Statistics and Data Sets for Corpus Frequency Data

Utility functions for the statistical analysis of corpus frequency data. This package is a companion to the open-source course "Statistical Inference: A Gentle Introduction for Computational Linguists and Similar Creatures" ('SIGIL').


Version 0.5:

  • consolidated and revised example data sets included in the package
  • various small, but convenient utility functions added
  • maintainer information updated

Version 0.4-9:

  • transitory version in which data sets and utility functions for the SIGIL course had been moved into a separate package
  • since the new SIGIL package wasn't accepted by CRAN, the data sets have been moved back into corpora
  • intended re-design of corpora package was cancelled
  • in future, it will be used to collect miscellaneous utility functions for analyzing corpus frequency data

Version 0.4-3:

  • interim release to ensure compatiblity with stricter CRAN checks
  • added data set with Biber register features for all BNC texts (from Gasthaus 2007)
  • some minor corrections

Version 0.4-1:

  • large simulated census data set for examples and illustrations in the SIGIL course
  • simulated type-token statistics from English Wikipedia (based on Wackypedia corpus)
  • convenience function for random samples of rows from a data frame (sample.df)

Version 0.4-0:

  • re-launch of the corpora package on R-Forge
  • first version 0.4-0 has only minor changes over previous release 0.3-2

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.5 by Stefan Evert, 2 years ago

Browse source code at

Authors: Stefan Evert []

Documentation:   PDF Manual  

Task views: Natural Language Processing

GPL-3 license

Imports methods, stats, utils, grDevices

See at CRAN