These functions take a character vector as input, identify and cluster similar values, and then merge clusters together so their values become identical. The functions are an implementation of the key collision and ngram fingerprint algorithms from the open source tool Open Refine < http://openrefine.org/>. More info on key collision and ngram fingerprint can be found here < https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth>.
refinr is designed to cluster and merge similar values within a character vector. It features two functions that are implementations of clustering algorithms from the open source software OpenRefine. The cluster methods used are key collision and ngram fingerprint (more info on these here).
In addition, there are a few add-on features included, to make the clustering/merging functions more useful. These include approximate string matching to allow for merging despite minor mispellings, the option to pass a dictionary vector to dictate edit values, and the option to pass a vector of strings to ignore during the clustering process.
Please report issues, comments, or feature requests.
Install from CRAN:
Or install the dev version from this repo:
x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "acme pizza LLC", "Acme Pizza, Inc.")key_collision_merge(x)#>  "Acme Pizza, Inc." "Acme Pizza, Inc." "Acme Pizza, Inc." "Acme Pizza, Inc."
A dictionary character vector can be passed to
key_collision_merge, which will dictate merge values when a cluster has a match within the dict vector.
x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "acme pizza LLC", "Acme Pizza, Inc.")key_collision_merge(x, dict = c("Nicks Pizza", "acme PIZZA inc"))#>  "acme PIZZA inc" "acme PIZZA inc" "acme PIZZA inc" "acme PIZZA inc"
n_gram_merge can be used to merge similar values that contain slight spelling differences. The stringdist package is used for calculating edit distance between strings.
refinr links to the stringdist C API to improve the speed of the functions.
x <- c("Acmme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")n_gram_merge(x, weight = c(d = 0.2, i = 0.2, s = 1, t = 1))#>  "ACME PIZA COMPANY" "ACME PIZA COMPANY" "ACME PIZA COMPANY"# The performance of the approximate string matching can be ajusted using parameters# "weight" and/or "edit_threshold".n_gram_merge(x, weight = c(d = 1, i = 1, s = 0.1, t = 0.1))#>  "Acme Pizzazza LLC" "ACME PIZA COMPANY" "Acme Pizzazza LLC"
n_gram_merge have optional arg
ignore_strings, which takes a character vector of strings to be ignored during the merging of values.
x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield")key_collision_merge(x, ignore_strings = c("high", "school", "highschool"))#>  "BAKERSFIELD high" "BAKERSFIELD high" "BAKERSFIELD high"
The clustering is designed to be insensitive to common business name suffixes, i.e. "inc", "llc", "co", etc. This feature can be turned on/off using function parameter
library(dplyr)library(knitr)x <- c("Clemsson University","university-of-clemson","CLEMSON","Clem son, U.","college, clemson u","M.I.T.","Technology, Massachusetts' Institute of","Massachusetts Inst of Technology","UNIVERSITY: mit")ignores <- c("university", "college", "u", "of", "institute", "inst")x_refin <- x %>%refinr::key_collision_merge(ignore_strings = ignores) %>%refinr::n_gram_merge(ignore_strings = ignores)# Create df for comparing the original values to the edited values.# This is especially useful for larger input vectors.inspect_results <- data_frame(original_values = x, edited_values = x_refin) %>%mutate(equal = original_values == edited_values)# Display only the values that were edited by refinr.knitr::kable(inspect_results[!inspect_results$equal, c("original_values", "edited_values")])#> |original_values |edited_values |#> |:---------------------------------------|:--------------------------------|#> |Clemsson University |CLEMSON |#> |university-of-clemson |CLEMSON |#> |Clem son, U. |CLEMSON |#> |college, clemson u |CLEMSON |#> |Technology, Massachusetts' Institute of |Massachusetts Inst of Technology |#> |UNIVERSITY: mit |M.I.T. |
stringdistC API, and calling C functions in place of using
stringdist::stringdistmatrix(). This change results in speed improvements in function
n_gram_merge(), and requires that
stringdistv0.9.5.1 or greater be installed.
n_gram_merge(), renamed arg
weight. The only purpose of this arg is to be passed along to function
stringdistmatrixfrom the stringdist package (which uses the name
weight, so this change is simply to match that).
Fixed issue in
n_gram_merge() in which incorrect values were being return when input arg
ignore_strings was not
NULL, and arg
bus_suffix = FALSE (#7).
Fixed issue in which input strings that contained punctuation that was NOT surrounded by spaces was returning incorrect values (#6).
Fixed issue in which the edit value assigned to a cluster was sometimes not the most frequent string in that cluster (#5).
std::unordered_map(), resulting in a substantial speed improvement when passing large character vectors (length 100,000+) to either of the exported functions (#8).