Record Linkage Functions for Linking and Deduplicating Data Sets

Provides functions for linking and deduplicating data sets. Methods based on a stochastic approach are implemented as well as classification algorithms from the machine learning domain. For details, see our paper "The RecordLinkage Package: Detecting Errors in Data" Sariyar M / Borg A (2010) .


Changes in version 0.4-11 -added missing S3 method exports -fixed broken/non-canonical URLs -fixed some logical expressions to deal with non-scalar values (contributed by Kurt Hornik) Changes in version 0.4-10 -added SQLite header files to be compatible with upcoming RSQLite package, which does not export them anymore Changes in version 0.4-9 -added import declarations to pass R 3.3.0 package checks Changes in version 0.4-8 -fixed import declaration for ffbase imports -translated source code omments -removed non-ASCII characters -added S3method-declarations in NAMESPACE Changes in version 0.4-7 -added unit tests (inst/unitTests) to .Rbuildignore -accelerated example for epiWeights and epiClassify

Changes in version 0.4-6 -updated maintainer, author and contact information -fixed notes "no visible binding" in R >= 2.15.1 -moved packages in "Depends" to "Imports" where appropriate and added import statement in NAMESPACE

Changes in version 0.4-5 -fixed truncated lines in PDF manual

Changes in version 0.4-4 -fixed memory access problem in C code

Changes in version 0.4-3 -added check if bringToTop() is availabe in getParetoThreshold -avoid warning in countpattern

Changes in version 0.4-2 -changed the way keys for data tables are set in order to be compatible to data.table 1.7.8

Changes in version 0.4-1 -Removed printf statements from compiled code to prevent NOTE in R CMD check

Changes in version 0.4-0 The classes for linkage of big data sets have been redesigned. The mechanism of creating comparison patterns on the fly from a database is replaced by saving all comparison patterns in a ffdf object (disk-based data frame, see package ff). This approach limits the amount of comparison patterns with respect to available disk space (the database approach was designed to handle even more patterns, at least in theory), but is more efficient regarding execution speed. In addition, some methods that were only usable with the S3 class RecLinkData, can now be used with RLBigData objects, most notably getMinimalTrain().

For the structure of the redesigned classes, see the documention entries for classes "RLBigData", "RLBigDataDedup" and "RLBigDataLinkage". :wq

Changes in version 0.3-5 -bug fixes: -getExpectedSize failed in rare cases Changes in version 0.3-4 -added summary method for RLResult objects -bug fixes: -summary.RecLinkResult failed for object without known matching status Changes in version 0.3-3 -added new classification method "bumping", see ?trainSupv -bug fixes: - string comparators crashed for character(0) input (reported by Dominik Reusser) - getPairs could fail when the result is emtpy -various performance improvements -renamed vignette "Fellegi-Sunter Deduplication" to "Weight-based deduplication" -package is now in a namespace

Changes in version 0.3-2 - performance improvements in various functions - bug fixes: - epiClassify failed for datasets with only one column (reported by Richard Herron) - getErrorMeasures returned wrong values in rare cases

Changes in version 0.3-1 - fixed bug in emClassify that threw an error during classification with on-the-fly calculation of weights - added checks in unit tests - fixed bug in optimalThreshold which caused wrong results in some cases when ny was supplied - added status messages and progress bars to various classification functions - fixed numerical problem that produced wrong thresholds in some cases for emClassify (with my / ny supplied)

Changes in version 0.3-0

- A framework of S4 methods and classes tailored for processing large
  data sets has been added. See the vignette 'BigData' for details.
- The output format of getPairs (existing method for RecLinkData and
  RecLinkResult objects) has changed: Column "Weight" is now last instead
  of first. Also, for single.rows=FALSE, record pairs are speparated by
  a blank line.

Changes in version 0.2-6

- added subset operator [ for RecLinkData and RecLinkPairs objects
- improved running time of epiWeights significantly

Changes in version 0.2-5

- Fixed a bug in getMinimalTrain which caused the function to fail

Changes in version 0.2-4

- Fixed a bug in getMinimalTrain that caused pairs with unknown
	matching status to become non-matches in the training set.
- Some examples were modified to reduce execution time.
- Unit tests have been added in inst/unitTests.
- Checking of arguments improved in some functions.
- Small documentation changes.

Changes in version 0.2-3

  • getParetoThreshold now prompts a message if no data lies in the interactively selected interval and asks the user to choose again
  • Corrected maintainer e-mail address
  • Several bug fixes

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.4-12.1 by Murat Sariyar, a year ago

Browse source code at

Authors: Murat Sariyar [aut, cre] , Andreas Borg [aut]

Documentation:   PDF Manual  

Task views:

GPL (>= 2) license

Imports e1071, rpart, ada, ipred, stats, evd, methods, data.table, nnet, xtable

Depends on DBI, RSQLite, ff

Suggests RUnit, knitr

Enhanced by SoundexBR.

See at CRAN