Package for corpus analysis using the Corpus Workbench ('CWB', < http://cwb.sourceforge.net/>) as an efficient back end for indexing and querying large corpora. The package offers functionality to flexibly create subcorpora and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document-term matrices, term-co-occurrence matrices etc.) can be created based on the indexed corpora.
Purpose: The focus of the package ‘polmineR’ is the interactive analysis of corpora using R. Core objectives for the development of the package are performance, usability, and a modular design.
Aims: Key aims for developing the package are:
To keep the original text accessible. A seamless integration of qualitative and quantitative steps in corpus analysis supports validation, based on inspecting the text behind the numbers.
To provide a library with standard tasks. It is an open source platform that will make text mining more productive, avoiding prohibitive costs to reimplement basics, or to run many lines of code to perform a basic tasks.
To create a package that makes the creation and analysis of subcorpora (‘partitions’) easy. A particular strength of the package is to support contrastive/comparative research.
To offer performance for users with a standard infrastructure. The package picks up the idea of a three-tier software design. Corpus data are managed and indexed by using the Open Corpus Workbench (CWB). The CWB is particularly efficient for storing large corpora and offers a powerful language for querying corpora, the Corpus Query Processor (CQP).
To support sharing consolidated and documented data, following the ideas of reproducible research.
Background: The polmineR-package was specifically developed to make full use of the XML annotation structure of the corpora created in the PolMine project (see polmine.sowi.uni-due.de). The core PolMine corpora are corpora of plenary protocols. In these corpora, speakers, parties etc. are structurally annotated. The polmineR-package is meant to help making full use of the rich annotation structure.
Upon loading polmineR, a message will report the version of the package and the location of a so-called ‘registry’-directory.
library(polmineR)#> polmineR v0.7.10.9001#> session registry: /private/var/folders/r6/1k6mxnbj5077980k11xvr0q40000gn/T/Rtmpe923aa/polmineR_registry
The session registry directory is populated with files that describe the corpora that are present and accessible on the user’s system.
Indexed corpora wrapped into R data packages can be installed from a (private) package repository.
install.packages("GermaParl", repos = "")install.packages("europarl.en", repos = "")
use()-function will activate a corpus included in a data
package. The registry files describing the corpora in a package are
added to the session registry directory.
use("europarl.en") # activate the corpus in the europarl-en package#> ... activating corpus: europarl-en
An advantage of keeping corpora in data packages are the versioning and documentation mechanisms that are the hallmark of packages. Of course, polmineR will work with the library of CWB indexed corpora stored on your machine. The corpora described in the registry directory defined by the environment variable CORPUS_REGISTRY will be added to the session registry directory when loading polmineR.
All methods can be applied to a whole corpus, as well as to partitions (i.e. subcorpora). Use the metadata of a corpus (so-called s-attributes) to define a subcorpus.
ep2005 <- partition("EUROPARL-EN", text_year = "2006")#> ... get encoding: latin1#> ... get cpos and strucssize(ep2005)#>  3100529
barroso <- partition("EUROPARL-EN", speaker_name = "Barroso", regex = TRUE)#> ... get encoding: latin1#> ... get cpos and strucssize(barroso)#>  98142
Partitions can be bundled into partition_bundle objects, and most methods can be applied to a whole corpus, a partition, or a partition_bundle object alike. Consult the package vignette to learn more.
Counting occurrences of a feature in a corpus, a partition or in the partitions of a partition_bundle is a basic operation. By offering access to the query syntax of the Corpus Query Processor (CQP), polmineR package exposes a query syntax that goes far beyond regular expressions. See the CQP documentation to learn more.
count("EUROPARL-EN", "France")#> query count freq#> 1: France 5517 0.0001399122count("EUROPARL-EN", c("France", "Germany", "Britain", "Spain", "Italy", "Denmark", "Poland"))#> query count freq#> 1: France 5517 1.399122e-04#> 2: Germany 4196 1.064114e-04#> 3: Britain 1708 4.331523e-05#> 4: Spain 3378 8.566676e-05#> 5: Italy 3209 8.138089e-05#> 6: Denmark 1615 4.095673e-05#> 7: Poland 1820 4.615557e-05count("EUROPARL-EN", '"[pP]opulism"')#> query count freq#> 1: "[pP]opulism" 107 2.713542e-06
The dispersion method is there to analyse the dispersion of a query, or a set of queries across one or two dimensions (absolute and relative frequencies). The CQP syntax can be used.
populism <- dispersion("EUROPARL-EN", "populism", s_Attribute = "text_year", progress = FALSE)popRegex <- dispersion("EUROPARL-EN", '"[pP]opulism"', s_attribute = "text_year", cqp = TRUE, progress = FALSE)
The cooccurrences method is used to analyse the context of a query (including some statistics).
islam <- cooccurrences("EUROPARL-EN", query = 'Islam', left = 10, right = 10)islam <- subset(islam, rank_ll <= 100)dotplot(islam)islam
Compare partitions to identify features / keywords (using statistical tests such as chi square).
ep2002 <- partition("EUROPARL-EN", text_year = "2002", p_attribute = "word")epPre911 <- partition("EUROPARL-EN", text_year = 1997:2001, p_attribute = "word")y <- features(ep2002, epPre911, included = FALSE)
So what happens in the context of a word, or a CQP query? To attain valid research results, reading will often be necessary. The kwic method will help, and uses the conveniences of DataTables, outputted in the Viewer pane of RStudio.
kwic("EUROPARL-EN", "Islam", meta = c("text_date", "speaker_name"))
Corpus analysis involves moving from text to numbers, and back again. Use the read method, to inspect the full text of a partition (a speech given by chancellor Angela Merkel in this case).
use("GermaParl")merkel <- partition("GERMAPARL", speaker = "Angela Merkel", date = "2013-09-03")read(merkel)
Many advanced methods in text mining require term document matrices as input. Based on the metadata of a corpus, these data structures can be obtained in a fast and flexible manner, for performing topic modelling, machine learning etc.
use("europarl.en")speakers <- partition_bundle("EUROPARL-EN", s_attribute = "speaker_id",progress = FALSE, verbose = FALSE)speakers_count <- count(speakers, p_attribute = "word", progress = TRUE)tdm <- as.TermDocumentMatrix(speakers_count, col = "count")dim(tdm)
The CRAN release of polmineR can be installed using
install.packages(), all dependencies will be installed, too.
To install the most recent development version that is hosted in a GitHub repository, use the installation mechanism offered by the devtools package.
install.packages("devtools")devtools::install_github("PolMine/polmineR", ref = "dev")
Check the installation by loading polmineR and activating the corpora included in the package.
The following instructions for Mac users assume that R is installed on your system. Binaries are available from the Homepage of the R Project. An installation of RStudio is highly recommended. Get the Open Source License version of RStudio Desktop.
At this stage, the RcppCWB dependency is not available as a pre-compiled binary and needs to be compiled. A set of system requirements needs to be fulfilled to do this.
First, you will need an installation of Xcode, which you can get it via the Mac App Store. You will also need the Command Line Tools for Xcode. It can be installed from a terminal with:
Please make sure that you agree to the license.
Second, an installation of XQuartz is required. It can be obtained from www.xquartz.org.
Third, to fulfill the system requirements of the RcppCWB package, the Glib and pcre libraries need to be installed. Using a package manager makes things considerably easier. We recommend using ‘Homebrew’. To install Homebrew, follow the instructions on the Homebrew Homepage. The following commands then need to be executed from a terminal window. They will install the C libraries that the RcppCWB package relies on:
brew -v install pkg-configbrew -v install glib --universalbrew -v install pcre --universalbrew -v install readline
The latest release of polmineR can be installed from CRAN using the
The development version of polmineR can be installed using devtools:
install.packages("devtools") # unless devtools is already installeddevtools::install_github("PolMine/polmineR", ref = "dev")
Check whether everything works by loading polmineR, and activating the demo corpora included in the package.
If you have not yet installed R on your Ubuntu machine, there is a good instruction at ubuntuuser. To install base R, enter in the terminal.
sudo apt-get install r-base r-recommended
Make sure that you have installed the latest version of R. The following commands will add the R repository to the package sources and run an update. The second line assumes that you are using Ubuntu 16.04.
sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com E084DAB9sudo add-apt-repository 'deb xenial/'sudo apt-get updatesudo apt-get upgrade
It is highly recommended to install RStudio, a powerful IDE for R. Output of polmineR methods is generally optimized to be displayed using RStudio facilities. If you are working on a remote server, running RStudio Server may be an interesting option to consider.
The RcppCWB package, the interface used by polmineR to query CWB corpora, will require the pcre, glib and pkg-config libraries. They can be installed as follows. In addition libxml2 is installed, a dependency of the R package xml2 that is used for manipulating html output.
sudo apt-get install libglib2.0-dev libssl-dev libcurl4-openssl-devsudo apt-get install libxml2-devsudo apt-get install libprotobuf-dev
The system requirements will now be fulfilled. From R, install dependencies for rcqp/polmineR first, and then rcqp and polmineR.
Use devtools to install the development version of polmineR from GitHub.
install.packages("devtools")devtools::install_github("PolMine/polmineR", ref = "dev")
You may want to install packaged corpora to run examples in the vignette, and the man packages.
To have access to all package functions and to run all package tests, the installation of further system requirements and packages is required. The xlsx dependency requires that rJava is installed and configured for R. That is done on the shell:
sudo apt-get install openjdk-8-jresudo R CMD javareconf
To run package tests including (re-)building the manual and vignettes, a working installation of Latex is required, too. Be aware that this may be a time-consuming operation.
sudo apt-get install texlive-full texlive-xetex
Now install the remaining packages from within R.
install.packages(pkgs = c("rJava", "xlsx", "tidytext"))
Cooccurrences()-method and a
Cooccurrences-class have been migrated from the (experimental) polmineR.graph package to polmineR to generate and manage all cooccurrences in a corpus/
cooccurrenes()-method produces a subset of
Cooccurrences-class object and is the basis for ensuring that results are identical.
data_dir()will return this temporary data directory. The
use()-function will now check for non-ASCII characters in the path to binary corpus data and move the corpus data to the temporary data directory (a subdirectory of the directory returned by
data_dir()), if necessary. An argument
use()will force using a temporary directory. The temporary files are removed when the package is detached.
zoom()-method. See documentation for (new)
?"corpus-class") and extended documentation for
?"partition-class"). A new
corpus()-method for character vector serves as a constructor. This is a beginning of somewhat re-arranging the class structure: The
regions-class now inherits from the new
corpus-class, and a new
subcorpus-class inherits from the
check_cqp_query()offers a preliminary check whether a CQP query may be faulty. It is used by the
cpos()-method, if the new argument
checkis TRUE. All higher-level functions calling
cpos()also include this new argument. Faulty queries may still cause a crash of the R session, but the most common source is prevent now, hopefully.
format()-method is defined for
features, moving the formatting of tables out of the
print()-methods. This will be useful when including tables in Rmarkdown documents.
data_dir()now accept an argument
pkg. The functions will return the path to the registry directory / the data directory within a package, if the argument is used.
data.table-package used to be imported entirely, now the package is imported selectively. To avoid namespace conflicts, the former S4 method
as.data.table()is now a S3 method. Warnings appearing if the
data.tablepackage is loaded after polmineR are now omitted.
coerce()-methodes to turn
kwicobjects into htmlwidgets now set a
p_attributehas been added to the
kwic()-methods and methods to process
kwic-objects are now able to use the attribute thus indicated, and not just the p-attribute "word".
context-objects will return the size of the corpus of interest (coi) and the reference corpus (ref).
encoding()-method for character vector.
name()-method for character vector.
context-objects will return the
stat-slot with the counts for the tokens in the window.
decode()-function replaces a
decode()-method and can be applied to partitions. The return value is a
data.tablewhich can be coerced to a
tibble, serving as an interface to tidytext (#37).
ngrams()-method will work for corpora, and a new
textstat-object generates a proper output (#27).
tempdir()is wrapped into normalizePath(..., winslash = "/"), to avoid mixture of file separators in a path, which may cause problems on Windows systems.
kwic()-method for corpora returned one surplus token to the left and to the right of the query. The excess tokens are not removed.
character-objects method did not include the correct position of matches in the
as.speeches()-method, an error could occur when an empty partition has been generated accidentaly. Has been removed. (#50)
as.VCorpus()-method is not available if the
tm-package has been loaded previously. A coerce method (
as(OBJECT, "VCorpus")) solves the issue. Theas.VCorpus()`-method is still around, but serves as a wrapper for the formal coerce-method (#55).
verboseas used by the
use()-method did not have any effect. Now, messages are not reported as would be expected, if
FALSE. On this occasion, we took care that corpora that are activated are now reported in capital letters, which is consistent with the uppercase logic you need to follow when using corpora. (#47)
context()-method would occurr at the very beginning or very end of a corpus and the window would transgress the beginning / end of the corpus without being checked (#44).
as.speeches()-function caused an error when the type of the partition was not defined. Solved (#57).
partition_bundleif the partitions in the
partition_bundlewere not named. The fix is to assign integer numbers as names to the partitions (#58).
chisquare()-methods to make the statistical procedure used transparent.
cooccurrences()-method to explain subsetting results vs applying positivelist/negativelist (#28).
textstat-objects that will show up in documentation of
decode()-function, using the REUTERS corpus replaces the usage of the GERMAPARLMINI corpus, to reduce time consumed when checking the package.
weigh()-method has been implemented for the classes
count_bundle. Via inheritance, it will also be available for the
partition_bundle-classes. Then, a new
partition-class objects is introduced. If the object has been weighed, the list that is returned will include a report on weights. There is an example that explains the workflow.
context-objects has been reworked entirely (and is working again); a new
context-objects has been introduced. Buth steps are intended for workflows for dictionary-based sentiment analysis.
highlight()-method is now implemented for class
kwic. You can highlight words in the neighborhood of a node that are part of a dictionaty.
kwic-objects offers a seamless inclusion of analyses in Rmarkdown documents.
coerce()-method to turn a
kwic-object into a htmlwidget has been singled out from the
kwic-objects. Now it is possible to generate a htmlwidget from a kwic object, and to include the widget into a Rmarkdown document.
coerce()-method to turn
textstat-objects into an htmlwidget (DataTable), very useful for Rmarkdown documents such as slides.
html()-method will allow to define a scroll box. Useful to embed a fulltext output to a Rmarkdown document.
partition_bundle-class, rather than inheriting from
bundle-class directly, will now inherit from the
use()-function is limited now to activating the corpus in data packages. Having introduced the session registry, switching registry directories is not needed any more.
as.regions()-function has been turned into a
as.regions()-method to have a more generic tool.
context-method, so that full use of
data.tablespeeds up things.
highlight()-method allows definitions of terms to be highlighted to be passed in via three dots (...); no explicit list necessary.
as.character()-method for kwic-class objects is introduced.
size_coi-slot (coi for corpus of interest) of the
context-object included the node; the node (i.e. matches for queries) is excluded now from the count of size_coi.
use(), the registry directory is reset for CQP, so that the corpora in the package that have been activated can be used with CQP syntax.
partition-objects: "fast track" was activated without preconditions.
kwic-output after highlighting.
metahas been renamed to
context-objects, and for the
s_attributeto check for integrity within a struc has been renamed into
kwic-objects has been reworked thoroughly.