Fast, Consistent Tokenization of Natural Language Text

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, tweets, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.


tokenizers

An R package that collects functions with a consistent interface to convert natural language text into tokens.

Author: Lincoln Mullen and Dmitriy Selivanov
License: MIT

CRAN_Status_Badge CRAN_Downloads Travis-CI Build Status AppVeyor Build Status Coverage Status

You can install this package from CRAN:

install.packages("tokenizers")

To get the development version from GitHub, use devtools.

# install.packages("devtools")
devtools::install_github("ropensci/tokenizers")

Examples

The tokenizers in this package have a consistent interface. They all take either a character vector of any length, or a list where each element is a character vector of length one. The idea is that each element comprises a text. Then each function returns a list with the same length as the input vector, where each element in the list contains the tokens generated by the function. If the input character vector or list is named, then the names are preserved, so that the names can serve as identifiers.

library(magrittr)
library(tokenizers)
 
james <- paste0(
  "The question thus becomes a verbal one\n",
  "again; and our knowledge of all these early stages of thought and feeling\n",
  "is in any case so conjectural and imperfect that farther discussion would\n",
  "not be worth while.\n",
  "\n",
  "Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
  "for us _the feelings, acts, and experiences of individual men in their\n",
  "solitude, so far as they apprehend themselves to stand in relation to\n",
  "whatever they may consider the divine_. Since the relation may be either\n",
  "moral, physical, or ritual, it is evident that out of religion in the\n",
  "sense in which we take it, theologies, philosophies, and ecclesiastical\n",
  "organizations may secondarily grow.\n"
)
 
tokenize_characters(james)[[1]] %>% head(50)
#>  [1] "t" "h" "e" "q" "u" "e" "s" "t" "i" "o" "n" "t" "h" "u" "s" "b" "e"
#> [18] "c" "o" "m" "e" "s" "a" "v" "e" "r" "b" "a" "l" "o" "n" "e" "a" "g"
#> [35] "a" "i" "n" "a" "n" "d" "o" "u" "r" "k" "n" "o" "w" "l" "e" "d"
tokenize_character_shingles(james)[[1]] %>% head(20)
#>  [1] "the" "heq" "equ" "que" "ues" "est" "sti" "tio" "ion" "ont" "nth"
#> [12] "thu" "hus" "usb" "sbe" "bec" "eco" "com" "ome" "mes"
tokenize_words(james)[[1]] %>% head(10)
#>  [1] "the"      "question" "thus"     "becomes"  "a"        "verbal"  
#>  [7] "one"      "again"    "and"      "our"
tokenize_word_stems(james)[[1]] %>% head(10)
#>  [1] "the"      "question" "thus"     "becom"    "a"        "verbal"  
#>  [7] "one"      "again"    "and"      "our"
tokenize_sentences(james) 
#> [[1]]
#> [1] "The question thus becomes a verbal one again; and our knowledge of all these early stages of thought and feeling is in any case so conjectural and imperfect that farther discussion would not be worth while."                                               
#> [2] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean for us _the feelings, acts, and experiences of individual men in their solitude, so far as they apprehend themselves to stand in relation to whatever they may consider the divine_."
#> [3] "Since the relation may be either moral, physical, or ritual, it is evident that out of religion in the sense in which we take it, theologies, philosophies, and ecclesiastical organizations may secondarily grow."
tokenize_paragraphs(james)
#> [[1]]
#> [1] "The question thus becomes a verbal one again; and our knowledge of all these early stages of thought and feeling is in any case so conjectural and imperfect that farther discussion would not be worth while."                                                                                                                                                                                                                                                                   
#> [2] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean for us _the feelings, acts, and experiences of individual men in their solitude, so far as they apprehend themselves to stand in relation to whatever they may consider the divine_. Since the relation may be either moral, physical, or ritual, it is evident that out of religion in the sense in which we take it, theologies, philosophies, and ecclesiastical organizations may secondarily grow. "
tokenize_ngrams(james, n = 5, n_min = 2)[[1]] %>% head(10)
#>  [1] "the question"                   "the question thus"             
#>  [3] "the question thus becomes"      "the question thus becomes a"   
#>  [5] "question thus"                  "question thus becomes"         
#>  [7] "question thus becomes a"        "question thus becomes a verbal"
#>  [9] "thus becomes"                   "thus becomes a"
tokenize_skip_ngrams(james, n = 5, k = 2)[[1]] %>% head(10)
#>  [1] "the becomes one our all"          "question a again knowledge these"
#>  [3] "thus verbal and of early"         "becomes one our all stages"      
#>  [5] "a again knowledge these of"       "verbal and of early thought"     
#>  [7] "one our all stages and"           "again knowledge these of feeling"
#>  [9] "and of early thought is"          "our all stages and in"
tokenize_lines(james)[[1]] %>% head(5)
#> [1] "The question thus becomes a verbal one"                                   
#> [2] "again; and our knowledge of all these early stages of thought and feeling"
#> [3] "is in any case so conjectural and imperfect that farther discussion would"
#> [4] "not be worth while."                                                      
#> [5] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean"
tokenize_regex(james, pattern = "[,.;]")[[1]] %>% head(5)
#> [1] "The question thus becomes a verbal one\nagain"                                                                                                                     
#> [2] " and our knowledge of all these early stages of thought and feeling\nis in any case so conjectural and imperfect that farther discussion would\nnot be worth while"
#> [3] "\n\nReligion"                                                                                                                                                      
#> [4] " therefore"                                                                                                                                                        
#> [5] " as I now ask you arbitrarily to take it"

Contributing

Contributions to the package are more than welcome. One way that you can help is by using this package in your R package for natural language processing. If you want to contribute a tokenization function to this package, it should follow the same conventions as the rest of the functions whenever it makes sense to do so.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.


rOpenSCi logo

News

Tokenizers 0.1.4

  • Add the tokenize_character_shingles() tokenizer.
  • Improvements to documentation.

tokenizers 0.1.3

  • Add vignette.
  • Improvements to n-gram tokenizers.

tokenizers 0.1.2

  • Add stopwords for several languages.
  • New stopword options to tokenize_words() and tokenize_word_stems().

tokenizers 0.1.1

  • Fix failing test in non-UTF-8 locales.

tokenizers 0.1.0

  • Initial release with tokenizers for characters, words, word stems, sentences paragraphs, n-grams, skip n-grams, lines, and regular expressions.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("tokenizers")

0.2.1 by Lincoln Mullen, 2 months ago


https://lincolnmullen.com/software/tokenizers/


Report a bug at https://github.com/ropensci/tokenizers/issues


Browse source code at https://github.com/cran/tokenizers


Authors: Lincoln Mullen [aut, cre] (<https://orcid.org/0000-0001-5103-6917>), Os Keyes [ctb] (<https://orcid.org/0000-0001-5196-609X>), Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb] (<https://orcid.org/0000-0001-9953-3904>), Kenneth Benoit [ctb] (<https://orcid.org/0000-0002-0797-564X>)


Documentation:   PDF Manual  


Task views: Natural Language Processing


MIT + file LICENSE license


Imports stringi, Rcpp, SnowballC

Suggests covr, knitr, rmarkdown, stopwords, testthat

Linking to Rcpp


Imported by covfefe, proustr, ptstem, tidytext.

Suggested by text2vec.


See at CRAN