Text Processing for Small or Big Data Files

Processes big text data files in batches efficiently. For this purpose, it offers functions for splitting, parsing, tokenizing and creating a vocabulary. Moreover, it includes functions for building either a document-term matrix or a term-document matrix and extracting information from those (term-associations, most frequent terms). Lastly, it embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. The source code is based on 'C++11' and exported in R through the 'Rcpp', 'RcppArmadillo' and 'BH' packages.

The textTinyR package consists of text pre-processing functions for small or big data files. More details on the functionality of the textTinyR can be found in the package Vignette. The R package can be installed, in the following OS's: Linux, Mac and Windows. However, there are some limitations :

  • there is no support for chinese, japanese, korean, thai or languages with ambiguous word boundaries.
  • there is no support functions for utf-locale on windows, meaning only english character strings or files can be input and pre-processed.

System Requirements ( for unix OS's )


sudo apt-get install libboost-all-dev

sudo apt-get update

sudo apt-get install libboost-locale-dev


yum install boost-devel

Macintosh OSX/brew

UPDATE 25-05-2017 : The current CRAN version of the package can only be installed on Linux and Windows. If the boost locale are installed properly on your OSystem use the devtools::install_github(repo = 'mlampros/textTinyR', clean = TRUE) function to download the textTinyR package.

The boost library will be installed on the Macintosh OSx using the Homebrew package manager,

If the boost library is already installed using brew install boost then it must be removed using the following command,

brew uninstall boost

Then the formula for the boost library should be modified using a text editor (TextEdit, TextMate, etc). The formula on a Macintosh OS Sierra is saved in:


The user should open the boost.rb formula and replace the following code chunk beginning from (approx.) line 71,

# layout should be synchronized with boost-python
args = ["--prefix=#{prefix}",
if build.with? "single"
  args << "threading=multi,single"
  args << "threading=multi"

with the following code chunk,

# layout should be synchronized with boost-python
args = ["--prefix=#{prefix}",
#if build.with? "single"
#  args << "threading=multi,single"
#  args << "threading=multi"

Then the user should save the changes, close the file and run,

brew update

to apply the changes.

Then he/she should open a new terminal (console) and type the following command, which installs the boost library using the modified formula from source, (warning: there are two dashes before : build-from-source)

brew install /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core/Formula/boost.rb --build-from-source

That's it.

Installation of the textTinyR package (CRAN, Github)

To install the package from CRAN use,

install.packages('textTinyR', clean = TRUE)

and to download the latest version from Github use the install_github function of the devtools package,

devtools::install_github(repo = 'mlampros/textTinyR', clean = TRUE)

Use the following link to report bugs/issues,



textTinyR 1.0.9

I added the global_term_weights() method in the sparse_term_matrix R6 class

textTinyR 1.0.8

I removed the threads parameter from the term_associations method of the sparse_term_matrix R6-class. I modified the OpenMP clauses of the .cpp files to address the ASAN errors.

textTinyR 1.0.7

I added the triplet_data() method in the sparse_term_matrix R6 class

textTinyR 1.0.6

I removed the ngram_sequential and ngram_overlap stemmers from the vocabulary_parser function. I fixed a bug in the char_n_grams of the token_stats.h source file.

textTinyR 1.0.5

I removed the ngram_sequential and ngram_overlap stemmers from the sparse_term_matrix and tokenize_transform_vec_docs functions. I overlooked the fact that the n-gram stemming is based on the whole corpus and not on each vector of the document(s), which is the case for the sparse_term_matrix and tokenize_transform_vec_docs functions. I added a zzz.R file with a packageStartupMessage to inform the users about the previous change in n-gram stemming. I also updated the package documentation and Vignette. I modified the secondary_n_grams of the tokenization.h source file due to a bug. I've used the enc2utf8 function) to encode (utf-8) the terms of the sparse matrix.

textTinyR 1.0.4

I modified the res_token_vector(), res_token_list() [ export_all_funcs.cpp file ] and append_2file() [ tokenization.h file ] functions, because the tokenize_transform_vec_docs() function returned an incorrect output in case that the path_2folder parameter was not the empty string.

textTinyR 1.0.3

I corrected the UBSAN-memory errors, which occured in the adj_Sparsity() function of the term_matrix.h header file (the errors happen, when passing empty vectors to the armadillo batch_insertion() function)

textTinyR 1.0.2

I included detailed installation instructions for the Macintosh OSx I modified the source code to correct the boost-locale errors, which occurred during testing on Macintosh OSx

textTinyR 1.0.1

I added the following system-flag in the Makevars.in file to avoid linking errors for the Mac OS: -lboost_system I modified the term_associations and Term_Matrix_Adjust methods to avoid indexing errors I corrected mistakes in the Vignette

textTinyR 1.0.0

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


1.0.9 by Lampros Mouselimis, 2 months ago


Report a bug at https://github.com/mlampros/textTinyR/issues

Browse source code at https://github.com/cran/textTinyR

Authors: Lampros Mouselimis <[email protected]>

Documentation:   PDF Manual  

GPL-3 license

Imports Rcpp, R6, data.table, utils

Depends on Matrix

Suggests testthat, covr, knitr, rmarkdown

Linking to Rcpp, RcppArmadillo, BH

System requirements: The package requires the following two components : A C++11 compiler and on a unix OS the boost-locale headers and libraries ( boost >= 1.55.0 , www.boost.org ). Debian/Ubuntu: libboost-locale-dev, Fedora : yum install boost-devel, OSX/brew : detailed installation instructions can be found in the README file

See at CRAN