Modern Text Mining Framework for R

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.

CRAN_Status_Badge Travis-CI Build Status codecov License Downloads Follow

You've just discovered text2vec!

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

Goals which we aimed to achieve as a result of development of text2vec:

  • Concise - expose as few functions as possible
  • Consistent - expose unified interfaces, no need to explore new interface for each task
  • Flexible - allow to easily solve complex tasks
  • Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
  • Memory efficient - use streams and iterators, not keep data in RAM if possible

To learn how to use this package, see and the package vignettes. See also the text2vec articles on my blog.


The core functionality at the moment includes

  1. Fast text vectorization on arbitrary n-grams, using vocabulary or feature hashing.
  2. GloVe word embeddings.
  3. Topic modeling with:
  • Latent Dirichlet Allocation
  • Latent Sematic Analysis
  1. Similarities/distances between 2 matrices


htop Author of the package is a little bit obsessed about efficiency.

This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts (such as GloVe) are fully parallelized using the excellent RcppParallel package. This means that the word embeddings are computed in parallel on OS X, Linux, Windows, and even Solaris (x86) without any additional tuning or tricks.

Other emrassingly parallel tasks (such as vectorization) can use any parallel backend which supports foreach package. They can achieve near-linear scalability with number of available cores.

Finally, a streaming API means that users do not have to load all the data into RAM.


The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by:


GPL (>= 2)


text2vec 0.5.1 [2018-01-10]

  1. 2018-01-10
    • removed rank* columns from collocation_stat - were never used internally. Users can easily calculate ranks themselves
  2. 2018-01-09
    • Added Bi-Normal Separation transformation, thanks to Pavel Shashkin ( @pshashk )
    • Added Dunning's log-likelihood ratio for collocations, thanks to Chris Lee ( @Chrisss93 )
    • Early stopping for collocations learning
  3. 2017-12-18
    • fixed several bugs #219 #217 #205
    • decreased number of dependencies - no more magrittr, uuid, tokenizers
    • removed distributed LDA which didn't work correctly
  4. 2017-10-18
    • Now tokenization is based on tokenizers and THE stringi packages.
    • models API follow mlapi package. No API changes on text2vec side - we just put abstract scikit-learn-like classes to a separate package in order to make them more reusable.

text2vec 0.5.0

  1. 2017-06-12
    • Add additional filters to prune_vocabulary - filter by document counts
    • Clean up LSA, fixed transform method. Added option to use randomized SVD algorithm from irlba.
  2. 2017-05-17
  3. 2017-05-17
    • API breaking change - vocabulary format change - now plain data.frame with meta-information in attributes (stopwords, ngram, number of docs, etc).
  4. 2017-03-25
    • No more rely on RcppModules
    • API breaking change - removed lda_c from formats in DTM construction
    • added ifiles_parallel, itoken_parallel high-level functions for parallel computing
    • API breaking change chunks_numer parameter renamed to n_chunks
  5. 2017-01-02
    • API breaking change - removed create_corpus from public API, moved co-occurence related optons to create_tcm from vecorizers
    • add ability to add custom weights for co-occurence statistics calculations
  6. 2016-12-30
    • Noticeable speedup (1.5x) and even more noticeable improvement on memory usage (2x less!) for create_dtm, create_tcm . Now package relies on sparsepp library for underlying hash maps.
  7. 2016-10-30
    • Collocations - detection of multi-word phrases using differend heuristics - PMI, gensim, LFMD.
  8. 2016-10-20
    • Fixed bug in as.lda_c() function

text2vec 0.4.0

2016-10-03. See 0.4 milestone tags.

  1. Now under GPL (>= 2) Licence
  2. "immutable" iterators - no need to reinitialize them
  3. unified models interface
  4. New models: LSA, LDA, GloVe with L1 regularization
  5. Fast similarity and distances calculation: Cosine, Jaccard, Relaxed Word Mover's Distance, Euclidean
  6. Better hadnling UTF-8 strings, thanks to @qinwf
  7. iterators and models rely on R6 package

text2vec 0.3.0

  1. 2016-01-13 fix for #46, thanks to @buhrmann for reporting
  2. 2016-01-16 format of vocabulary changed.
    • do not keep doc_proportions. see #52.
    • add stop_words argument to prune_vocabulary. signature also was changed.
  3. 2016-01-17 fix for #51. if iterator over tokens returns list with names, these names will be:
    • stored as attr(corpus, 'ids')
    • rownames in dtm
    • names for dtm list in lda_c format
  4. 2016-02-02 high level function for corpus and vocabulary construction.
    • construction of vocabulary from list of itoken.
    • construction of dtm from list of itoken.
  5. 2016-02-10 rename transformers
    • now all transformers starts with transform_* - more intuitive + simpler usage with autocompletion
  6. 2016-03-29 (accumulated since 2016-02-10)
    • rename vocabulary to create_vocabulary.
    • new functions create_dtm, create_tcm.
    • All core functions are able to benefit from multicore machines (user have to register parallel backend themselves)
    • Fix for progress bars. Now they are able to reach 100% and ticks increased after computation.
    • ids argument to itoken. Simplifies assignement of ids to rows of DTM
    • create_vocabulary now can handle stopwords
    • see all updates here
  7. 2016-03-30 more robust split_into() util.

text2vec 0.2.0 (2016-01-10)

First CRAN release of text2vec.

  • Fast text vectorization with stable streaming API on arbitrary n-grams.
    • Functions for vocabulary extraction and management
    • Hash vectorizer (based on digest murmurhash3)
    • Vocabulary vectorizer
  • GloVe algorithm word embeddings.
    • Fast term-co-occurence matrix factorization via parallel async AdaGrad.
  • All core functions written in C++.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.6 by Dmitriy Selivanov, 2 years ago

Report a bug at

Browse source code at

Authors: Dmitriy Selivanov [aut, cre, cph] , Manuel Bickel [aut, cph] (Coherence measures for topic models) , Qing Wang [aut, cph] (Author of the WaprLDA C++ code)

Documentation:   PDF Manual  

Task views: Natural Language Processing

GPL (>= 2) | file LICENSE license

Imports Matrix, Rcpp, R6, data.table, rsparse, stringi, mlapi, lgr, digest

Depends on methods

Suggests magrittr, udpipe, glmnet, testthat, covr, knitr, rmarkdown, proxy

Linking to Rcpp, digest

System requirements: C++11

Imported by conText, fdm2id, oolong, text2map, textfeatures, textmineR, wactor, wordsalad.

Suggested by lime, quanteda, textrecipes.

See at CRAN