Diagnostics to Assess the Effects of Text Preprocessing Decisions

Functions to assess the effects of different text preprocessing decisions on the inferences drawn from the resulting document-term matrices they generate.

An R package to assess the consequences of text preprocessing decisions.

[getting started with preText vignette].

The paper detailing the procedure can be found at the link below:

  • Matthew J. Denny, and Arthur Spirling (2017). "Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It". [ssrn.com/abstract=2849145]


The easiest way to do this is to install the package from CRAN via the standard install.packages command:


If you want to get the latest version from GitHub, start by checking out the Requirements for using C++ code with R section in the following tutorial: Using C++ and R code Together with Rcpp. You will likely need to install either Xcode or Rtools depending on whether you are using a Mac or Windows machine before you can install the preText package via GitHub, since it makes use of C++ code.


Now we can install from Github using the following line:


Once the GERGM package is installed, you may access its functionality as you would any other package by calling:


If all went well, you should be able to replicate the steps in the vignette("getting_started").

Basic Usage

The basic functionality of this package is detailed in a vignette, which is [available here]. Beyond this basic functionality the package includes a number of additional utility and analysis functions for exploring and comparing multiple document--term matrices.

Bug Reporting



Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.6.2 by Matthew J. Denny, 2 years ago

Browse source code at https://github.com/cran/preText

Authors: Matthew J. Denny <[email protected]> , Arthur Spirling <[email protected]> ,

Documentation:   PDF Manual  

GPL-3 license

Imports quanteda, ggplot2, vegan, grid, parallel, topicmodels, cowplot, ecodist, proxy, reshape2

Suggests testthat, knitr, rmarkdown

See at CRAN