A Consistent Interface to Tokenize Natural Language Text

Convert natural language text into tokens. The tokenizers have a consistent interface and are compatible with Unicode, thanks to being built on the 'stringi' package. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, lines, and regular expressions.


An R package that collects functions with a consistent interface to convert natural language text into tokens.

Author: Lincoln Mullen and Dmitriy Selivanov
License: MIT

You can install this package from CRAN:

install.packages("tokenizers")

To get the development version from GitHub, use devtools.

# install.packages("devtools")
devtools::install_github("ropensci/tokenizers")

The tokenizers in this package have a consistent interface. They all take either a character vector of any length, or a list where each element is a character vector of length one. The idea is that each element comprises a text. Then each function returns a list with the same length as the input vector, where each element in the list contains the tokens generated by the function. If the input character vector or list is named, then the names are preserved, so that the names can serve as identifiers.

library(magrittr)
library(tokenizers)
 
james <- paste0(
  "The question thus becomes a verbal one\n",
  "again; and our knowledge of all these early stages of thought and feeling\n",
  "is in any case so conjectural and imperfect that farther discussion would\n",
  "not be worth while.\n",
  "\n",
  "Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
  "for us _the feelings, acts, and experiences of individual men in their\n",
  "solitude, so far as they apprehend themselves to stand in relation to\n",
  "whatever they may consider the divine_. Since the relation may be either\n",
  "moral, physical, or ritual, it is evident that out of religion in the\n",
  "sense in which we take it, theologies, philosophies, and ecclesiastical\n",
  "organizations may secondarily grow.\n"
)
 
tokenize_characters(james)[[1]] %>% head(50)
#>  [1] "t" "h" "e" "q" "u" "e" "s" "t" "i" "o" "n" "t" "h" "u" "s" "b" "e"
#> [18] "c" "o" "m" "e" "s" "a" "v" "e" "r" "b" "a" "l" "o" "n" "e" "a" "g"
#> [35] "a" "i" "n" "a" "n" "d" "o" "u" "r" "k" "n" "o" "w" "l" "e" "d"
tokenize_character_shingles(james)[[1]] %>% head(20)
#>  [1] "the" "heq" "equ" "que" "ues" "est" "sti" "tio" "ion" "ont" "nth"
#> [12] "thu" "hus" "usb" "sbe" "bec" "eco" "com" "ome" "mes"
tokenize_words(james)[[1]] %>% head(10)
#>  [1] "the"      "question" "thus"     "becomes"  "a"        "verbal"  
#>  [7] "one"      "again"    "and"      "our"
tokenize_word_stems(james)[[1]] %>% head(10)
#>  [1] "the"      "question" "thus"     "becom"    "a"        "verbal"  
#>  [7] "one"      "again"    "and"      "our"
tokenize_sentences(james) 
#> [[1]]
#> [1] "The question thus becomes a verbal one again; and our knowledge of all these early stages of thought and feeling is in any case so conjectural and imperfect that farther discussion would not be worth while."                                               
#> [2] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean for us _the feelings, acts, and experiences of individual men in their solitude, so far as they apprehend themselves to stand in relation to whatever they may consider the divine_."
#> [3] "Since the relation may be either moral, physical, or ritual, it is evident that out of religion in the sense in which we take it, theologies, philosophies, and ecclesiastical organizations may secondarily grow."
tokenize_paragraphs(james)
#> [[1]]
#> [1] "The question thus becomes a verbal one again; and our knowledge of all these early stages of thought and feeling is in any case so conjectural and imperfect that farther discussion would not be worth while."                                                                                                                                                                                                                                                                   
#> [2] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean for us _the feelings, acts, and experiences of individual men in their solitude, so far as they apprehend themselves to stand in relation to whatever they may consider the divine_. Since the relation may be either moral, physical, or ritual, it is evident that out of religion in the sense in which we take it, theologies, philosophies, and ecclesiastical organizations may secondarily grow. "
tokenize_ngrams(james, n = 5, n_min = 2)[[1]] %>% head(10)
#>  [1] "the question"                   "the question thus"             
#>  [3] "the question thus becomes"      "the question thus becomes a"   
#>  [5] "question thus"                  "question thus becomes"         
#>  [7] "question thus becomes a"        "question thus becomes a verbal"
#>  [9] "thus becomes"                   "thus becomes a"
tokenize_skip_ngrams(james, n = 5, k = 2)[[1]] %>% head(10)
#>  [1] "the becomes one our all"          "question a again knowledge these"
#>  [3] "thus verbal and of early"         "becomes one our all stages"      
#>  [5] "a again knowledge these of"       "verbal and of early thought"     
#>  [7] "one our all stages and"           "again knowledge these of feeling"
#>  [9] "and of early thought is"          "our all stages and in"
tokenize_lines(james)[[1]] %>% head(5)
#> [1] "The question thus becomes a verbal one"                                   
#> [2] "again; and our knowledge of all these early stages of thought and feeling"
#> [3] "is in any case so conjectural and imperfect that farther discussion would"
#> [4] "not be worth while."                                                      
#> [5] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean"
tokenize_regex(james, pattern = "[,.;]")[[1]] %>% head(5)
#> [1] "The question thus becomes a verbal one\nagain"                                                                                                                     
#> [2] " and our knowledge of all these early stages of thought and feeling\nis in any case so conjectural and imperfect that farther discussion would\nnot be worth while"
#> [3] "\n\nReligion"                                                                                                                                                      
#> [4] " therefore"                                                                                                                                                        
#> [5] " as I now ask you arbitrarily to take it"

Contributions to the package are more than welcome. One way that you can help is by using this package in your R package for natural language processing. If you want to contribute a tokenization function to this package, it should follow the same conventions as the rest of the functions whenever it makes sense to do so.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.


News

Tokenizers 0.1.4

  • Add the tokenize_character_shingles() tokenizer.
  • Improvements to documentation.

tokenizers 0.1.3

  • Add vignette.
  • Improvements to n-gram tokenizers.

tokenizers 0.1.2

  • Add stopwords for several languages.
  • New stopword options to tokenize_words() and tokenize_word_stems().

tokenizers 0.1.1

  • Fix failing test in non-UTF-8 locales.

tokenizers 0.1.0

  • Initial release with tokenizers for characters, words, word stems, sentences paragraphs, n-grams, skip n-grams, lines, and regular expressions.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("tokenizers")

0.1.4 by Lincoln Mullen, 10 months ago


https://github.com/ropensci/tokenizers


Report a bug at https://github.com/ropensci/tokenizers/issues


Browse source code at https://github.com/cran/tokenizers


Authors: Lincoln Mullen [aut, cre], Dmitriy Selivanov [ctb]


Documentation:   PDF Manual  


Task views: Natural Language Processing


MIT + file LICENSE license


Imports stringi, Rcpp, SnowballC

Suggests testthat, covr, knitr, rmarkdown

Linking to Rcpp


Imported by covfefe, ptstem, tidytext.

Suggested by cleanNLP.


See at CRAN