Functions to generate stop-word lists in 110 languages, in a way consistent across all the languages supported. The generated lists are based on the morphological tagset from the Universal Dependencies.
Authors: Silvie Cinková*, Maciej Eder
An R package containing customizable lists of stopwords in multiple languages; it attempts to follow tidy data principles.
The idea behind this package is to give the user control over the stopword selection. The core
generate_stoplist() function relies on
multilingual_stopwords(), a large data frame derived from the current release of the Universal Dependencies Treebanks. We have included all languages whose corpora totalled above 10,000 tokens – large enough to cover all common closed-class words, such as prepositions, conjunctions, and auxiliary verbs. The data comes encoded in UTF-8.
Install the package directly from the GitHub repository:
library(devtools) install_github("computationalstylistics/stopwoRds", build_vignettes = TRUE)