Distributional Semantic Models in R

An interactive laboratory for research on distributional semantic models ('DSM', see < https://en.wikipedia.org/wiki/Distributional_semantics> for more information).


Version 0.2-0:

  • the first public release of the wordspace package \o/

Version 0.1-24:

  • read.dsm.triplet can now load marginal frequencies from separate files, so it is a full-fledged replacement for read.dsm.ucs (though less memory-efficient than the native UCS format for very large models)
  • RC 2 for public release

Version 0.1-23:

  • completed vignette with tutorial introduction
  • RC 1 for public release

Version 0.1-22:

  • clean up & complete documentation in preparation for first public release
  • reduce size of example data (verb-noun triples, pre-compiled DSM vectors)

Version 0.1-21:

  • replace readr::read_delim() with iotools::read.delim.raw(), which is slightly faster and leaner; also avoids expensive dependencies of readr such as BH (Boost libraries)
  • implemented work-arounds to support compressed files and different character encodings with iotools
  • package test for file input (triplet and UCS format, different encodings) with suitable sample files in extdata/

Version 0.1-20:

  • new sample data: DSM objects for small illustrative term-term and term-context matrix

Version 0.1-19:

  • complete basic documentation for all functions and data sets
  • data set DSM_VerbNounTriples_DESC removed to reduce package size
  • dsm.projection() now supports power-scaling for SVD-based projection methods

Version 0.1-18:

  • efficient truncated SVD of sparse matrix using SVDLIBC code from 'sparsesvd' package
  • faster reading of triplet files with 'readr' package (though not very memory-efficient)

Version 0.1-17:

  • Minkowski distance and length measures generalized to 0 <= p < 1 (but not homogeneous for p < 1, hence not a proper mathematical norm)

Version 0.1-16:

  • plot() method for dist.matrix for easy visualization of neighbourhood graphs
  • head() methods to extract top left corner of DSM object (dsm) or distance matrix (dist.matrix)
  • print() method for DSM objects, so users don't accidentally print a large co-occurrence matrix
  • new sample data set: DSM_Vectors with 100-dimensional pre-compiled representations for selected words
  • new sample data: typical singular values from term-context matrix
  • new sample data: "goods" example illustrating dimensionality reduction based on correlations

Version 0.1-14:

  • new evaluation task: SemCorWSD (preliminary version)
  • CITATION entry with official reference (Evert 2014)
  • enhanced functionality in nearest.neighbours(): support for cross-distance setting, targets can be given as vectors or by name, neighbour search in pre-computed distance or similarity matrix, optionally return distance matrix for target and its neighbours

Version 0.1-13:

  • Rcpp implementation of scaleMargins() further reduces memory overhead (with in-place operation for internal use)
  • as.dsm() method converts term-document and document-term matrices from tm package into DSM objects
  • added support functions for evaluation of DSMs in standard tasks (multiple choice, similarity correlation and clustering)
  • new sample data sets: tables of verb-noun cooccurrences from BNC and DESC corpora
  • new evaluation tasks: RG65, WordSim353, ESSLLI08_Nouns

Version 0.1-10:

  • use Rcpp instead of deprecated .C() native code interface
  • for performance reasons, .C() was used with DUP=FALSE, which is no longer allowed as of R 3.1.0
  • in addition, some package tests for dsm.score(), dist.matrix() and dsm.projection() were added
  • the package now depends on Rcpp (>= 0.11.0) and R (>= 3.0.0)

Version 0.1:

  • partial re-design of DSM objects and basic functions
  • handling of sparse and non-negative co-occurrence matrices has been re-thought
  • not fully compatible with v0.0 series (but basic usage should not be affected)

Version 0.0-25:

  • randomized SVD available as separate function rsvd()

Version 0.0-24

  • OpenMP no longer activated by default
  • wordspace.openmp() to check for OpenMP support and select the number of parallel threads

Version 0.0-23

  • further performance improvements
  • dist.matrix() uses less memory and is considerably faster for cosine/angle distance measure
  • new function pair.distances() computes distances or neighbour ranks for a list of word pairs efficiently
  • nearest.neighbours() automatically processes a long list of lookup terms in moderately sized batches

Version 0.0-21:

  • experimental support for OpenMP on appropriate platforms
  • n/a on Mac OS X in the default R installation (but achieves speed-up if expressly activated)
  • parallelization only used if more than 100 M operations have to be carried out (purely heuristic limit)
  • first experiments suggests that using more than 4 or 8 threads brings little benefit with enormous overhead
  • setting OMP_NUM_THREADS is strongly recommended but may also affect BLAS matrix operations (e.g. with OpenBLAS)

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.2-0 by Stefan Evert, 6 months ago


Browse source code at https://github.com/cran/wordspace

Authors: Stefan Evert [http://www.stefan-evert.de/]

Documentation:   PDF Manual  

GPL-3 license

Imports Rcpp, sparsesvd, iotools, methods, stats, utils, graphics, grDevices, cluster, MASS

Depends on Matrix

Suggests knitr, rmarkdown, tm

Linking to Rcpp

See at CRAN