Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling.
Wraps the 'sentencepiece' library < https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units.
The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018)