A Simple General Purpose N-Gram Tokenizer

A simple n-gram (contiguous sequences of n items from a given sequence of text) tokenizer to be used with the 'tm' package with no 'rJava'/'RWeka' dependency.


A fast ngram tokenizer for R so that we don't need to install RWeka just for NGramTokenizer().

usage

require(ngramrr)
require(tm)
require(magrittr)
 
nirvana <- c("hello hello hello how low", "hello hello hello how low",
"hello hello hello how low", "hello hello hello",
"with the lights out", "it's less dangerous", "here we are now", "entertain us",
"i feel stupid", "and contagious", "here we are now", "entertain us",
"a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")
 
ngramrr(nirvana[1], ngmax = 3)
 
Corpus(VectorSource(nirvana)) %>%
TermDocumentMatrix(control = list(tokenize = function(x) ngramrr(x, ngmax =3))) %>%
inspect
 
# Character ngram
 
Corpus(VectorSource(nirvana)) %>%
TermDocumentMatrix(control = list(tokenize = function(x) ngramrr(x, char = TRUE, ngmax =3))) %>%
inspect
 
# dtm2 and tdm2 wrappers
 
dtm2(nirvana, ngmax = 3, removePunctuation = TRUE)

News

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("ngramrr")

0.2.0 by Chung-hong Chan, 4 years ago


https://github.com/chainsawriot/ngramrr


Browse source code at https://github.com/cran/ngramrr


Authors: Chung-hong Chan <[email protected]>


Documentation:   PDF Manual  


GPL-2 license


Imports tm, tau

Suggests testthat, magrittr


Imported by SentimentAnalysis.


See at CRAN