Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.
You've just discovered text2vec!
text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).
Goals which we aimed to achieve as a result of development of text2vec
:
To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.
The core functionality at the moment includes
Author of the package is a little bit obsessed about efficiency.
This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts (such as GloVe) are fully parallelized using the excellent RcppParallel package. This means that the word embeddings are computed in parallel on OS X, Linux, Windows, and even Solaris (x86) without any additional tuning or tricks.
Other emrassingly parallel tasks (such as vectorization) can use any parallel backend which supports foreach package. They can achieve near-linear scalability with number of available cores.
Finally, a streaming API means that users do not have to load all the data into RAM.
The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.
Contributors are welcome. You can help by:
GPL (>= 2)
collocation_stat
- were never used internally. Users can easily calculate ranks themselvesmagrittr
, uuid
, tokenizers
text2vec
side - we just put abstract scikit-learn
-like classes to a separate package in order to make them more reusable.prune_vocabulary
- filter by document countsirlba
.dist2
performamce for RWMD - incorporate ideas from gensim PR discussion.data.frame
with meta-information in attributes (stopwords, ngram, number of docs, etc).lda_c
from formats in DTM constructionifiles_parallel
, itoken_parallel
high-level functions for parallel computingchunks_numer
parameter renamed to n_chunks
create_corpus
from public API, moved co-occurence related optons to create_tcm
from vecorizerscreate_dtm
, create_tcm
. Now package relies on sparsepp library for underlying hash maps.as.lda_c()
function2016-10-03. See 0.4 milestone tags.
R6
packagedoc_proportions
. see #52.stop_words
argument to prune_vocabulary
. signature also was changed.attr(corpus, 'ids')
lda_c
formatitoken
.itoken
.transform_*
- more intuitive + simpler usage with autocompletionvocabulary
to create_vocabulary
.create_dtm
, create_tcm
.ids
argument to itoken
. Simplifies assignement of ids to rows of DTMcreate_vocabulary
now can handle stopwords
split_into()
util.First CRAN release of text2vec.