Data for Wordpiece-Style Tokenization

Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from < https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and < https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.


News

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("wordpiece.data")

1.0.2 by Jon Harmon, 2 months ago


https://github.com/macmillancontentscience/wordpiece.data


Report a bug at https://github.com/macmillancontentscience/wordpiece.data/issues


Browse source code at https://github.com/cran/wordpiece.data


Authors: Jonathan Bratt [aut] , Jon Harmon [aut, cre] , Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph] , Google , Inc [cph] (original BERT vocabularies)


Documentation:   PDF Manual  


Apache License (>= 2) license


Suggests testthat


Imported by wordpiece.


See at CRAN