'Rcpp' Bindings for the 'Corpus Workbench' ('CWB')

'Rcpp' Bindings for the C code of the 'Corpus Workbench' ('CWB'), an indexing and query engine to efficiently analyze large corpora (< http://cwb.sourceforge.net>). 'RcppCWB' is licensed under the GNU GPL-3, in line with the GPL-3 license of the 'CWB' (< https://www.r-project.org/ Licenses/GPL-3>). The 'CWB' relies on 'pcre' (BSD license, see < https://www.pcre.org/ licence.txt>) and 'GLib' (LGPL license, see < https://www.gnu.org/licenses/lgpl-3.0.en. html>). See the file LICENSE.note for further information. The package includes modified code of the 'rcqp' package (GPL-2, see < https://cran.r-project.org/package=rcqp>). The original work of the authors of the 'rcqp' package is acknowledged with great respect, and they are listed as authors of this package. To achieve cross-platform portability (including Windows), using 'Rcpp' for wrapper code is the approach used by 'RcppCWB'.


License CRAN_Status_Badge Travis-CI BuildStatus AppVeyor Build Status codecov DOI

The package exposes functions of the Corpus Worbench (CWB) by way of Rcpp wrappers. Furthermore, the packages includes Rcpp code for performance critical operations. The main purpose of the package is to serve as an interface to the CWB for the package polmineR.

There is a huge intellectual debt to the developers of the R-package ‘rcqp’, Bernard Desgraupes and Sylvain Loiseau. The main impetus for developing RcppCWB is that using Rcpp decreases the pains to maintain the package, to expand the CWB functionality exposed, and – most importantly – to make it portable to Windows systems.

Installation on Windows

Pre-compiled binaries of the package ‘RcppCWB’ can be obtained from CRAN.

install.packages("RcppCWB")

If you want to get the development version, you need to compile RcppCWB yourself. Having Rtools installed on your system is necessary. Using the mechanism offered by the devtools package, you can install RcppCWB from GitHub.

if (!"devtools" %in% installed.packages()[,"Package"]) install.packages("devtools")
devtools::install_github("PolMine/RcppCWB")

During the installation, cross-compiled versions of the corpus library (CL) are downloaded from the GitHub repository PolMine/libcl. It is not necessary to install dependencies.

Installation on Ubuntu

The package includes the source code of the Corpus Workbench (CWB), slightly modified to make it compatible with R requirements. Compiling the CWB requires the pcre and glib libraries to be present. Using the Aptitude package manager (Ubuntu/Debian), running the following command from the shell will fulfill these dependencies.

sudo apt-get install libpcre3-dev libglib2.0-dev

Then, use the conventional R installation mechanism to install R dependencies, and the release of RcppCWB at CRAN.

install.packages(pkgs = c("Rcpp", "knitr", "testthat"))
install.packages("RcppCWB")

To install the development version, using the mechanism offered by the devtools package is recommended.

if (!"devtools" %in% installed.packages()[,"Package"]) install.packages("devtools")
devtools::install_github("PolMine/RcppCWB")

Installation on MacOS

On macOS, the pcre and Glib libraries need to be present. We recommend to use ‘Homebrew’ as a package manager for macOS. To install Homebrew, follow the instructions on the Homebrew Website. It may also be necessary to also install Xcode and XQuartz.

The following commands then need to be executed from a terminal window. They will install the C libraries the CWB relies on:

brew -v install pkg-config
brew -v install glib --universal
brew -v install pcre --universal
brew -v install readline

Then open R and use the conventional R installation mechanism to install dependencies, and the release of RcppCWB at CRAN.

install.packages(pkgs = c("Rcpp", "knitr", "testthat"))
install.packages("RcppCWB")

To install the development version, using the mechanism offered by the devtools package is recommended.

if (!"devtools" %in% installed.packages()[,"Package"]) install.packages("devtools")
devtools::install_github("PolMine/RcppCWB")

Usage

The package offers low-level access to CWB-indexed corpora. Using RcppCWB may not intuitive at the outset: It is designed to serve as a an efficient backend for packages offering higher-level functionality, such as polmineR. the

RcppCWB includes a small sample corpus called (‘REUTERS’). After loading the package, we need to determine whether we can use the registry describing the corpus within the package, or whether we need to work with a temporary registry.

library(RcppCWB)
if (!check_pkg_registry_files()){
  registry <- use_tmp_registry()
} else {
  registry <- get_pkg_registry()

To start with, we get the number of tokens of the corpus.

cpos_total <- cl_attribute_size(
  corpus = "REUTERS", attribute = "word",
  attribute_type = "p", registry = registry
)
cpos_total
## [1] 4050

To decode the token stream of the corpus.

token_stream_str <- cl_cpos2str(
  corpus = "REUTERS", p_attribute = "word",
  cpos = seq.int(from = 0, to = cpos_total - 1),
  registry = registry
  )

To get the corpus positions of a token.

token_to_get <- "oil"
id_oil <- cl_str2id(corpus = "REUTERS", p_attribute = "word", str = token_to_get)
cpos_oil <- cl_id2cpos <- cl_id2cpos(corpus = "REUTERS", p_attribute = "word", id = id_oil)

Get the frequency of token.

oil_freq <- cl_id2freq(corpus = "REUTERS", p_attribute = "word", id = id_oil)

Using regular expressions.

ids <- cl_regex2id(corpus = "REUTERS", p_attribute = "word", regex = "M.*")
m_words <- cl_id2str(corpus = "REUTERS", p_attribute = "word", id = ids)

To use the CQP syntax, we need to initialize CQP first.

cqp_initialize(registry = registry)
## Warning in cqp_initialize(registry = registry): CQP has already been
## initialized. Re-initialization is not possible. Only resetting registry.

## [1] TRUE
cqp_query(corpus = "REUTERS", query = '"crude" "oil"')
## NULL
cpos <- cqp_dump_subcorpus(corpus = "REUTERS")
cpos
##       [,1] [,2]
##  [1,]   14   15
##  [2,]   56   57
##  [3,]  548  549
##  [4,]  584  585
##  [5,]  607  608
##  [6,] 2497 2498
##  [7,] 2842 2843
##  [8,] 2891 2892
##  [9,] 2928 2929
## [10,] 3644 3645
## [11,] 3709 3710
## [12,] 3998 3999

License

The packge is licensed under the GNU General Public License 3. For the copyrights for the ‘Corpus Workbench’ (CWB) and acknowledgement of authorship, see the file COPYRIGHTS.

Acknowledgements

There is a huge intellectual debt to the developers of the R-package ‘rcqp’, Bernard Desgraupes and Sylvain Loiseau. Developing RcppCWB would have been unthinkable without their original work to wrap the CWB into an R package.

The CWB is a classic and mature tool: The work of the CWB developers, Oliver Christ, Bruno Maximilian Schulze, Arne Fitschen and Stefan Evert is gratefully acknowledged.

News

RcppCWB 0.2.8

  • There have been (minor) modifiations of the C code of the CWB so that compilation succeeds on Solaris.
  • Using the '-C' flag in the CWB Makefiles has been replaced by 'cd cl' / 'cd cqp' to avoid dependence on GNU make. GNU make is still required, because of 'include' statements in the Makefiles.
  • Removed an action on 'depend.mk' from 'cleanup' script to avoid error messages that depend.mk is not present when Makefiles are first loaded.
  • Dummy depend.mk files will satisfy include statement in Makefiles when running 'make clean' (depend.mk files are created only when running depend.mk)
  • For creating index of static archives (libcl, libcqb, libcwb), a call to 'ranlib' has been replaced by an equivalent 'ar -s' in the Makefiles, but commented out.
  • In the platform-specific config files of the CWB, the '-march'-option has been taken out, to safeguard portability.
  • To meet the requirements of the upcoming changes in the CRAN check process to use staged installs, the procedure to reset the paths in the test data within the package has been replaced throughout by using a temporary registry directory. The get_tmp_registry() will return the whereabouts of this directory.

RcppCWB 0.2.7

  • If glib-2.0 is not present on macOS, binaries of the static library and header files are downloaded from a GitHub repo. This prepares to get RcppCWB pass macOS checking on CRAN machines.
  • A slight modification of the C code will now prevent previous crashes resulting from a faulty CQP syntax. The solution will not yet be effective for Windows systems until we have recompiled the libcqp static library that is downloaded during the installation process.
  • A new C++-level function 'check_corpus' checks whether a given corpus is available and is used by the check_corpus()-function. Problems with the previous implementation that relied on files in the registry directory to ensure the presence of a corpus hopefully do not occur.
  • Calling the 'find_readline.perl' utility script is omitted on macOS, so previous warning messages when running the makefile do not show up any more.

RcppCWB 0.2.6

  • Function cl_charset_name() is exposed, it will return the charset of a corpus. Faster than parsing the registry file again and again.
  • A new cl_delete_corpus()-function can remove loaded corpora from memory.

RcppCWB 0.2.5

  • In Makevars.win, libiconv is explicitly linked, to make RcppCWB compatible with new release of Rtools.
  • regex in check_s_attribute() for parsing registry file improved so that it does not produce an error if '# [attribute]' follows after declaration of s_attribute

RcppCWB 0.2.4

  • for linux and macOS, CWB 3.4.14 included, so that UTF-8 support is realized
  • bug removed in check_cqp_query that would prevent special characters from working in CQP queries
  • check_strucs, check_cpos and check_id are checking for NAs now to avoid crashes
  • cwb command line tools cwb-makeall, cwb-huffcode and cwb-compress-rdx exposed as cwb_makeall, cwb_huffcode and cwb_compress_rdx

RcppCWB 0.2.3

  • when loading the package, a check is performed to make sure that paths in the registry files point to the data files of the sample data (issues may occur when installing binaries)
  • auxiliary functions to check whether input to Rcpp-wrappers/C functions is valid are now exported and documented
  • more consistent validity checks of input to functions for structural attributes

RcppCWB 0.2.2

  • Compiling RcppCWB on unix-like systems (macOS, Linux) will work now without the presence of glib (on Windows, the dependency persists).
  • The presence of the bison parser is not required any more. The package includes the C source generated by the bison parser along with the original input files.
  • Functionality to generate CWB-indexed corpora and to generate and manipulate the registry file describing a corpus has been moved to a new package 'cwbtools' (see https://www.github.com/PolMine/cwbtools) in order to maintain a clearly defined scope of RcppCWB to expose functionality of the C code of the CWB.
  • Minor intervention in function 'valid_subcorpus_name' to omit a -Wtautological-pointer-compare warning leading to a WARNING when checking package for R 3.5.0 with option --as-cran

RcppCWB 0.2.1

  • In previous versions the drive of the working directory and of the registry/data directory had to be identical on Windows; this limitation does not persist;
  • Some utility functions could be removed that were necessary to check the identity of the drives of the working directory and the data.

RcppCWB 0.2.0

  • In addition to low-level functionality of the corpus library (CL), functions of the Corpus Query Processor (CQP) are exposed, building on C wrappers in the rcqp package;
  • The authors of the rcqp package (Bernard Desgraupes and Sylvain Loiseau) are mentioned as package authors and as authors of functions using CQP, as the code used to expose CQP functionality is a modified version of rcqp code;
  • Extended package description explaining the rationale for developing the RcppCWB package;
  • Documentation of functions has been rearranged, many examples have been included;
  • Renaming of exposed functions of corpus library from cwb_... to cl_...;
  • sanity checks in R wrappers for Rcpp functions.

RcppCWB 0.1.7

  • CWB source code included in package to be GPL compliant
  • template to adjust HOME and INFO in registry file used (tools/setpaths.R)
  • using VignetteBuilder has been removed
  • definition of Rprintf in cwb/cl/macros.c

RcppCWB 0.1.6

  • now using configure/configure.win script in combination with setpaths.R

RcppCWB 0.1.1

  • vignette included that explains cross-compiling CWB for Windows
  • check in struc2str to ensure that structure has attributes

RcppCWB 0.1.0

  • Windows compatibility (potentially still limited)

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("RcppCWB")

0.2.8 by Andreas Blaette, 7 months ago


https://www.github.com/PolMine/RcppCWB


Report a bug at https://github.com/PolMine/RcppCWB/issues


Browse source code at https://github.com/cran/RcppCWB


Authors: Andreas Blaette [aut, cre] , Bernard Desgraupes [aut] , Sylvain Loiseau [aut] , Oliver Christ [ctb] , Bruno Maximilian Schulze [ctb] , Stefan Evert [ctb] , Arne Fitschen [ctb]


Documentation:   PDF Manual  


GPL-3 license


Imports Rcpp

Suggests knitr, testthat

Linking to Rcpp

System requirements: GNU make, pcre (>= 7), GLib (>= 2.0.0). On Windows, no prior installations are necessary, as pre-built (i.e. cross-compiled) binaries of required libraries are downloaded from a GitHub repository (<https://github.com/PolMine/libcl>) during installation. On macOS, static libraries of Glib are downloaded (<https://github.com/PolMine/libglib>) if Glib is not present.


Imported by polmineR.


See at CRAN