Optimized prediction based on textual sentiment, accounting for the intrinsic challenge that sentiment can be computed and pooled across texts and time in various ways. See Ardia et al. (2020)
The sentometrics
package is an integrated framework for textual sentiment time series aggregation and prediction. It accounts for the intrinsic challenge that, for a given text, sentiment can be computed in many different ways, as well as the large number of possibilities to pool sentiment across texts and time. This additional layer of manipulation does not exist in standard text mining and time series analysis packages. The package therefore integrates the fast qualification of sentiment from texts, the aggregation into different sentiment time series and the optimized prediction based on these measures.
See the project page, the vignette and following paper for respectively a brief and an extensive introduction to the package, and a real-life macroeconomic forecasting application.
To install the package from CRAN, simply do:
install.packages("sentometrics")
The latest development version of sentometrics
is available at https://github.com/sborms/sentometrics. To install this version (which may contain bugs!), execute:
devtools::install_github("sborms/sentometrics")
Please cite sentometrics
in publications. Use citation("sentometrics")
.
This software package originates from a Google Summer of Code 2017 project.
peakdates()
peakdocs()
function and added a peakdates()
function to properly handle the entire functionality of extracting peakssentiment_bind()
, and to_sentiment()
sentolexicons
objectlag = 1
in the ctr_agg()
function, and set weights to 1 by default for n = 1
in the weights_beta()
functionabind
package from Importszoo
package from Imports, by replacing the single occurrence of the zoo::na.locf()
function by the fill_NAs()
helper function (written in Rcpp
)quanteda::docvars()
replacement method to a sentocorpus
object"x"
output element from a sentomodel
object (for large samples, this became too memory consuming)"howWithin"
output element from a sentomeasures
object, and simplified a sentiment
object into a data.table
directly instead of a list
do.shrinkage.x
argument in the ctr_model()
function to a vector argumentdo.lags
argument to the attributions()
function, to be able to circumvent the most time-consuming part of the computationsento_measures()
function on the uniqueness of the names within and across the lexicons, features and time weighting schemesmeasures_merge()
function that made full merging not possiblen
argument in the peakdocs()
function can now also be specified as a quantilenCore
argument in the compute_sentiment()
and ctr_agg()
functions to 1compute_sentiment.sentocorpus()
function as a sentiment
object, and modified the aggregate()
function to aggregate.sentiment()
weights_beta()
, get_dates()
, get_dimensions()
, get_measures()
, and get_loss_data()
to_global()
to measures_global()
, perform_agg()
to aggregate()
, almons()
to weights_almon()
, exponentials()
to weights_exponential()
, setup_lexicons()
to sento_lexicons()
, retrieve_attributions()
to attributions()
, plot_attributions()
to plot.attributions()
ctr_merge()
function, so that all merge parameters have to be passed on directly to the measures_merge()
functioncenter
and scale
arguments in the scale()
functiondateBefore
and dateAfter
arguments to the measures_fill()
function, and dropped NA
option of its fill
argument"beta"
time aggregation option (see associated weights_beta()
function)"attribWeights"
element of output sentomeasures
object in required measures_xyz()
functions"lags"
) to the attributions()
function, and corrected some edge caseslambdas
argument to the ctr_model()
function, directly passed on to the glmnet::glmnet()
function if useddo.combine
argument in measures_delete()
and measures_select()
functions to simplifycovr
to Suggestscompute_sentiment()
function, by writing part of the code in Rcpp
relying on RcppParallel
(added to Imports); there are now three approaches to computing sentiment (unigrams, bigrams and clusters)dfm
argument in the compute_sentiment()
and ctr_agg()
functions by a tokens
argument, and altered the input and behaviour of the nCore
argument in these same two functionsquanteda
package to the stringi
package for more direct tokenisationlist_lexicons
and list_valence_shifters
built-in word lists by keeping only unigrams, and included same trimming procedure in the sento_lexicons()
function"t"
to the list_valence_shifters
built-in word list, and reset values of the "y"
column from 2 to 1.8 and from 0.5 to 0.2epu
built-in dataset with the newest available series, up to July 2018list_valence_shifters[["en"]]
compute_sentiment()
functionprint()
generic for a sentomeasures
object"tf-idf"
option for within-document aggregation in the ctr_agg()
functionsento_lexicons()
function outputs a sentolexicons
object, which the compute_sentiment()
function specifically requires as an input; a sentolexicons
object also includes a "["
class-preserving extractor functionattributions()
function outputs an attributions
object; the plot_attribtutions()
function is therefore replaced by the plot()
genericperform_MCS()
function, but the output of the get_loss_data()
function can easily be used as an input to the MCSprocedure()
function from the MCS
package (discarded from Imports)parallel
and doParallel
packages to Suggests, as only needed (if enacted) in the sento_model()
functionggthemes
from Importsmeasures_delete()
, nmeasures()
, nobs()
, and to_sentocorpus()
xyz_measures()
to measures_xyz()
, extract_peakdocs()
to peakdocs()
do.normalizeAlm
argument in the ctr_agg()
function, but kept in the almons()
functionalmons()
function to be consistent with Ardia et al. (2017) paperlexicons
to list_lexicons
, and valence
to list_valence_shifters
stats
element of a sentomeasures
object is now also updated in measures_fill()
"_eng"
to "_en"
' in list_lexicons
and list_valence_shifters
objects, to be in accordance with two-letter ISO language naming"valence_language"
naming to "language"
in list_valence_shifters
objectcompute_sentiment()
function now also accepts a quanteda
corpus
object and a character
vectoradd_features()
function now also accepts a quanteda
corpus
objectnCore
argument to the compute_sentiment()
, ctr_agg()
, and ctr_model()
functions to allow for (more straightforward) parallelized computations, and omitted the do.parallel
argument in the ctr_model()
functiondo.difference
argument to the ctr_model()
function and expanded the use of the already existing oos
argumentggplot2
and foreach
to Importsto_global()
tolower = FALSE
of quanteda::dfm()
constructor in compute_sentiment()
intercept
argument in ctr_model()
to do.intercept
for consistencysento_corpus()
and add_features()
diff()
, extract_peakdocs()
, and subset_measures()
sentimentr
incluce_valence()
helper function)"proportionalPol"
)dfm
argument in ctr_agg()
select_measures()
simplified, but toSelect
argument expandedto_global()
changed (see vignette)add_features()
: regex and non-binary (between 0 and 1) allowed