Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at < http://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at .


This repository contains an R package which is an Rcpp wrapper around the UDPipe C++ library (http://ufal.mff.cuni.cz/udpipe, https://github.com/ufal/udpipe).

  • UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.
  • The techniques used are explained in detail in the paper: "Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe", available at http://ufal.mff.cuni.cz/~straka/papers/2017-conll_udpipe.pdf. In that paper, you'll also find accuracies on different languages and process flow speed (measured in words per second).

General

The udpipe R package was designed with the following things in mind when building the Rcpp wrapper around the UDPipe C++ library:

  • Give R users simple access in order to easily tokenize, tag, lemmatize or perform dependency parsing on text in any language
  • Provide easy access to pre-trained annotation models
  • Allow R users to easily construct your own annotation model based on data in CONLL-U format as provided in more than 60 treebanks available at http://universaldependencies.org/#ud-treebanks
  • Don't rely on Python or Java so that R users can easily install this package without configuration hassle
  • No external R package dependencies except the strict necessary (Rcpp and data.table, no tidyverse)

Installation & License

The package is available under the Mozilla Public License Version 2.0. Installation can be done as follows. Please visit the package documentation at https://bnosac.github.io/udpipe/en and look at the R package vignettes for further details.

install.packages("udpipe")
vignette("udpipe-tryitout", package = "udpipe")
vignette("udpipe-annotation", package = "udpipe")
vignette("udpipe-usecase-postagging-lemmatisation", package = "udpipe")
# An overview of keyword extraction techniques: https://bnosac.github.io/udpipe/docs/doc7.html
vignette("udpipe-usecase-topicmodelling", package = "udpipe")
vignette("udpipe-train", package = "udpipe")

For installing the development version of this package: devtools::install_github("bnosac/udpipe", build_vignettes = TRUE)

Example

Currently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called udpipe

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
udmodel

language                                                                             file_model
   dutch C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-alpino-ud-2.3-181115.udpipe

x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.",
            object = udmodel)
x
  doc_id paragraph_id sentence_id start end term_id token_id     token     lemma  upos                     xpos                                                               feats head_token_id      dep_rel deps
   doc1            1           1     1   2       1        1        Ik        ik  PRON        Pron|per|1|ev|nom                          Case=Nom|Number=Sing|Person=1|PronType=Prs             2        nsubj <NA>
   doc1            1           1     4   7       2        2      ging        ga  VERB V|intrans|ovt|1of2of3|ev Aspect=Imp|Mood=Ind|Number=Sing|Subcat=Intr|Tense=Past|VerbForm=Fin             0         root <NA>
   doc1            1           1     9  10       3        3        op        op   ADP                Prep|voor                                                        AdpType=Prep             4         case <NA>
   doc1            1           1    12  15       4        4      reis      reis  NOUN          N|soort|ev|neut                                                         Number=Sing             2          obj <NA>
   doc1            1           1    17  18       5        5        en        en CCONJ               Conj|neven                                                                <NA>             7           cc <NA>
   doc1            1           1    20  21       6        6        ik        ik  PRON        Pron|per|1|ev|nom                          Case=Nom|Number=Sing|Person=1|PronType=Prs             7        nsubj <NA>
   doc1            1           1    23  25       7        7       nam      neem  VERB   V|trans|ovt|1of2of3|ev Aspect=Imp|Mood=Ind|Number=Sing|Subcat=Tran|Tense=Past|VerbForm=Fin             2         conj <NA>
   doc1            1           1    27  29       8        8       mee       mee   ADV                Adv|deelv                                                        PartType=Vbp             7 compound:prt <NA>
   doc1            1           1    30  30       9        9         :         : PUNCT            Punc|dubbpunt                                                      PunctType=Colo             2        punct <NA>
...

Pre-trained models

Pre-trained models build on Universal Dependencies treebanks are made available for more than 60 languages based on 90 treebanks, namely:

afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb.

These have been made available easily to users of the package by using udpipe_download_model

How good are these models?

Train your own models based on CONLL-U data

The package also allows you to build your own annotation model. For this, you need to provide data in CONLL-U format. These are provided for many languages at http://universaldependencies.org/#ud-treebanks, mostly under the CC-BY-SA license. How this is done is detailed in the package vignette.

vignette("udpipe-train", package = "udpipe")

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

News

CHANGES IN udpipe VERSION 0.8.1

  • Allow to pass on a .udpipe filename in udpipe_download_model
  • Update documentation on keywords_collocation
  • Added strsplit.data.frame and paste.data.frame

CHANGES IN udpipe VERSION 0.8

  • Default of udpipe_download_model is now changed, downloads now models built on Universal Dependencies 2.3 instead of the models build on Universal Dependencies 2.0
  • Incorporate models from Universal Dependencies 2.3 released on 2018-11-15
  • Incorporate models from conll18 shared task baseline built on Universal Dependencies 2.2
  • In case someone uses document_term_frequencies.character incorrectly with double document identifiers, make sure this is handled
  • txt_recode now returns x if the length of x is 0
  • added txt_sentiment
  • added txt_previousgram

CHANGES IN udpipe VERSION 0.7

  • Allow to reconstruct the original text + allow to add a start/end field in as.data.frame (useful but undocumented feature). Set up mainly to be used with the crfsuite R package
  • Added txt_tagsequence
  • Added 1 general function called udpipe which does annotation of data in TIF format.
  • Add option in udpipe_download_model to download the model only it does not exist on disk
  • Loaded model are put into an environment such that users of the function udpipe do not need to care about loading

CHANGES IN udpipe VERSION 0.6.1

  • src/udpipe.cpp: at the request of CRAN: remove dynamic execution specification which g++-7 and later complain about by removing the throw statements
  • add ctb role to authors Milan and Jana in DESCRIPTION

CHANGES IN udpipe VERSION 0.6

  • Added cbind_morphological and cbind_dependencies
  • Allow to show progress in udpipe_annotate
  • txt_nextgram now does not paste NA's together in case someone would use it with missing text data
  • Add example on only doing pos tagging and dependency parsing and excluding tokenisation
  • Fix gcc8 message: warning: 'char* strncpy(char*, const char*, size_t)' specified bound 15 equals destination size [-Wstringop-truncation]

CHANGES IN udpipe VERSION 0.5

  • Added txt_recode_ngram for recoding tokens with compound multi-word expressions
  • Fix to make sure as.data.frame.udpipe_connlu also works with data.table version 1.9.6. Fixes issue #16
  • Allow keywords_rake to use in group a character vector of column names
  • Added a vignette on the use of the package to do topic modelling using the POS tags and multi-word expressions
  • Add example of correlation analysis in vignette on 'Basic Analytical Use Cases'
  • dtm_remove_lowfreq to uses minfreq as lower bound

CHANGES IN udpipe VERSION 0.4

  • Fix R CMD check on clang-UBSAN: UndefinedBehaviorSanitizer (runtime error: reference binding to misaligned address)
  • Add more documentation on required UTF-8 encoding
  • Add as_conllu
  • Add as_word2vec
  • Add as.data.table.udpipe_conllu for convenience
  • Add keywords_rake and keywords_collocation
  • Exported also keywords_collocation and keywords_phrases
  • Add document_term_frequencies_statistics
  • Add boilerplate functions dtm_rowsums and dtm_colsums
  • Make output of keywords_collocation, keywords_rake and keywords_phrases consistent
  • Allow cooccurrence.data.frame to provide a vector of groups
  • Added another vignette

CHANGES IN udpipe VERSION 0.3

  • Add docusaurus site
  • udpipe_download_model gains and extra argument called udpipe_model_repo to allow to download models mainly released under CC-BY-SA from https://github.com/bnosac/udpipe.models.ud
  • Add udpipe_accuracy
  • Add dtm_rbind and dtm_cbind
  • Add udpipe_read_conllu to simplify creating wordvectors
  • Allow to provide several fields in document_term_frequencies to easily allow to include bigrams/trigrams/... for topic modelling purposes e.g. alongside the textrank package or alongside collocation
  • Adding Serbian + Afrikaans
  • Fixing UBSAN messages (misaligned addresses)
  • If user has R version < 3.3.0, use own startsWith function instead of base::startsWith

CHANGES IN udpipe VERSION 0.2.2

  • Another stab at fixing the Solaris compilation issue in ufal::udpipe::multiword_splitter::append_token

CHANGES IN udpipe VERSION 0.2.1

  • Added phrases to extract POS sequences more easily like noun phrases, verb phrases or any sequence of parts of speech tags and their corresponding words
  • Fix issue in txt_nextgram if n was larger than the number of elements in x
  • Fix heap-use-after-free address sanitiser issue
  • Fix runtime error: null pointer passed as argument 1, which is declared to never be null (e.g. udpipe.cpp: 3338)
  • Another stab at the Solaris compilation issue

CHANGES IN udpipe VERSION 0.2

  • Added data preparation elements for standard text mining flows namely: cooccurrence collocation document_term_frequencies document_term_matrix dtm_tfidf dtm_remove_terms dtm_remove_lowfreq dtm_remove_tfidf dtm_reverse dtm_cor txt_collapse txt_sample txt_show txt_highlight txt_recode txt_previous txt_next txt_nextgram unique_identifier
  • Added predict.LDA_VEM and predict.LDA_Gibbs
  • Renamed dataset annotation_params to udpipe_annotation_params
  • Added example datasets called brussels_listings, brussels_reviews, brussels_reviews_anno
  • Use path.expand on conll-u files which are used for training
  • udpipe_download_model now downloads from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.0/master instead of https://github.com/jwijffels/udpipe.models.ud.2.0/raw/master

CHANGES IN udpipe VERSION 0.1.2

  • Remove logic of UDPIPE_PROCESS_LOG (using Rcpp::Rout instead). This fixes issue detected with valgrind about ofstream

CHANGES IN udpipe VERSION 0.1.1

  • Fix issue on Solaris builds at CRAN, namely: error: expected primary-expression before ‘enum’
  • Use ufal::udpipe namespace directly
  • Documentation fixes

CHANGES IN udpipe VERSION 0.1

  • Initial release based on UDPipe commit a2ebb99d243546f64c95d0faf36882bb1d67a670
  • Allow to do annotation (tokenisation, POS tagging, Lemmatisation, Dependency parsing)
  • Allow to build your own UDPipe model based on data in CONLL-U format
  • Convert the output of udpipe_annotate to a data.frame
  • Allow to download models from https://github.com/jwijffels/udpipe.models.ud.2.0
  • Add vignettes

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("udpipe")

0.8.1 by Jan Wijffels, a month ago


https://bnosac.github.io/udpipe/en/index.html, https://github.com/bnosac/udpipe


Browse source code at https://github.com/cran/udpipe


Authors: Jan Wijffels [aut, cre, cph] , BNOSAC [cph] , Institute of Formal and Applied Linguistics , Faculty of Mathematics and Physics , Charles University in Prague , Czech Republic [cph] , Milan Straka [ctb, cph] , Jana Straková [ctb, cph]


Documentation:   PDF Manual  


Task views: Natural Language Processing


MPL-2.0 license


Imports Rcpp, data.table, Matrix, methods

Suggests knitr, topicmodels, lattice

Linking to Rcpp

System requirements: C++11


Imported by TextForecast, corpustools.

Suggested by BTM, cleanNLP, crfsuite, ruimtehol, textrank.

Enhanced by NLP.


See at CRAN