Distance based bipartite matching using the RELAX-IV minimum cost flow solver, oriented to matching of treatment and control groups in observational studies. Routines are provided to generate distances from generalised linear models (propensity score matching), formulas giving variables on which to limit matched distances, stratified or exact matching directives, or calipers, alone or in combination.
The optmatch
package implements the optimal full matching algorithm for
bipartite matching problems. Given a matrix describing the distances between
two groups (where one group is represented by row entries, and the other by
column entries), the algorithm finds a matching between units that
minimizes the average within grouped distances. This algorithm is a popular
choice for covariate balancing applications (e.g. propensity score matching),
but it also can be useful for design stage applications such as blocking. For
more on the application and its implementation, see:
Hansen, B.B. and Klopfer, S.O. (2006) Optimal full matching and
related designs via network flows, JCGS 15 609-627.
optmatch
is available on CRAN:
> library("optmatch")
In addition to the optimal full matching algorithm, the package contains useful functions for generating distance specifications, combining and editing distance specifications, and summarizing and displaying matches. This walk through shows how to use these tools in your matching workflow.
Before we start, let's generate some simulated data. We will have two groups,
the "treated" and "control" groups. Without our knowledge, nature assigned
units from a pool into one of these two groups. The probability of being a
treated unit depends on some covariates. In the vector Z
, let a 1 denote
treated units and 0 denote control units
set.seed(20120111) # set this to get the exact same answers as I do
n <- 26 # chosen so we can divide the alphabet in half
W <- data.frame(w1 = rbeta(n, 4, 2), w2 = rbinom(n, 1, p = .33))
# nature assigns to treatment
tmp <- numeric(n)
tmp[sample(1:n, prob = W$w1^(1 + W$w2), size = n/2)] <- 1
W$z <- tmp
# for convenience, let's give the treated units capital letter names
tmp <- character(n)
tmp[W$z == 1] <- LETTERS[1:(n/2)]
tmp[W$z == 0] <- letters[(26 - n/2 + 1):26]
rownames(W) <- tmp
As we can see with a simple table and plot, these groups are not balanced on the covariates, as they would be (in expectation) with a randomly assigned treatment.
table(W$w2, W$z)
library(lattice) ; densityplot(W$w1, groups = W$z)
The next steps use the covariates to pair up similar treated and control
units. For more on assessing the amount and severity of imbalance between
groups on observed covariates, see the
RItools R
package.
These two groups are different, but how different are individual treated units
from individual control units? In answering this question, we will produce
several distance specifications: matrices of treated units (rows) by control
units (columns) with entries denoting distances. optmatch
provides several
ways of generating these matrices so that you don't have to do it by hand.
Let's begin with a simple Euclidean distance on the space defined by W
:
distances <- list()
distances$euclid <- match_on(z ~ w1 + w2, data = W, method = "euclidean")
The method
argument tells the match_on
function how to compute the
distances over the space defined by the formula. The default method extends the
simple Euclidean distance by rescaling the distances by the covariance of the
variables, the Mahalanobis
distance:
distances$mahal <- match_on(z ~ w1 + w2, data = W)
You can write additional distance computation functions. See the documentation
for match_on
for more details on how to create these functions.
To create distances, we could also try regressing the treatment indicator on
the covariates and computing the difference distance for each treated and
control pair. To make this process easier, match_on
has methods for glm
objects (and for big data problems, bigglm
objects):
propensity.model <- glm(z ~ w1 + w2, data = W, family =
binomial())
distances$propensity <- match_on(propensity.model)
The glm
method is a wrapper around the numeric
method for match_on
. The
numeric
method takes a vector of scores (for example, the linear prediction
for each unit from the model) and a vector indicating treatment status (z
)
for each unit. This method returns the absolute difference between each
treated and control pair on their scores (additionally,
the glm
method rescales the data before invoking the numeric
method). If
you wish to fit a "caliper" to your distance matrix, a hard limit on allowed
distances between treated and control units, you can pass a caliper
argument, a scalar numeric value. Any treated and control pair that is larger
than the caliper value will be replaced by Inf
, an unmatchable value. The
caliper
argument also applies to glm
method. Calipers are covered in more
detail in the next section.
The final convenience method of match_on
is using an arbitrary function. This
function is probably most useful for advanced users of optmatch
. See the
documentation of the match_on
function for more details on how to write your
own arbitrary computation functions.
We have created several representations of the matching problem, using Euclidean distance, Mahalanobis distance, the estimated propensity score, and an arbitrary function. We can combine these distances into single metric using standard arithmetic functions:
distances$all <- with(distances, euclid + mahal + propensity)
You may find it convenient to work in smaller pieces at first and then stitch
the results together into a bigger distance. The rbind
and cbind
functions let us
add additional treated and control entries to a distance specification for
each of the existing control and treated units, respectively. For example, we
might want to combine a Mahalanobis score for units n
through s
with a
propensity score for units t
through z
:
W.n.to.s <- W[c(LETTERS[1:13], letters[14:19]),]
W.t.to.z <- W[c(LETTERS[1:13], letters[20:26]),]
mahal.n.to.s <- match_on(z ~ w1 + w2, data = W.n.to.s)
ps.t.to.z <- match_on(glm(z ~ w1 + w2, data = W.t.to.z, family = binomial()))
distances$combined <- cbind(mahal.n.to.s, ps.t.to.z)
The exactMatch
function creates "stratified" matching problems, in which
there are subgroups that are completely separate. Such matching problems are
often much easier to solve than problems where a treated unit could be
connected to any control unit.
There is another method for creating reduced matching problems. The caliper
function compares each entry in an existing distance specification and
disallows any that are larger than a specified value. For example, we can trim
our previous combined distance to anything smaller than the median value:
distances$median.caliper <- caliper(distances$all, median(distances$all))
distances$all.trimmed <- with(distances, all + median.caliper)
Like the exactMatch
function, the results of caliper
used the sparse
matrix representation mentioned above, so can be very efficient for large,
sparse problems. As noted previously, if using the glm
or numeric
methods
of match_on
, passing the caliper's width in the caliper
argument can be more
efficient.
In addition to the space advantages of only storing the finite entries in a
sparse matrix, the results of exactMatch
and caliper
can be used to speed
up computation of new distances. The match_on
function that we saw earlier
has an argument called within
that helps filter the resulting
computation to only the finite entries in the within
matrix. Since exactMatch
and caliper
use finite entries denote valid pairs, they make excellent sources of
the within
argument.
Instead of creating the entire Euclidean distance matrix and then filtering
out cross-strata matches, we use the results of exactMatch
to compute only
the interesting cases:
tmp <- exactMatch(z ~ w2, data = W)
distances$exact <- match_on(z ~ w1, data = W, within = tmp)
Users of previous versions of optmatch
may notice that the within
argument is similar to the old structure.formula
argument. Like
within
, structure.formula
focused distance on within strata pairs.
Unlike structure.formula
, the within
argument allows using any
distance specification as an argument, including those created with caliper
. For
example, here is the Mahalanobis distance computed only for units that differ
by less than one on the propensity score.
distances$mahal.trimmed <- match_on(z ~ w1 + w2, data = W,
within = match_on(propensity.model, caliper = 1))
Now that we have generated several distances specifications, let's put them to use. Here is the simplest way to evaluate all distances specifications:
matches <- lapply(distances, function(x) { fullmatch(x, data = W) })
The result of the matching process is a named factor, where the names
correspond to the units (both treated and control) and the levels of the
factors are the matched groups. Including the data
argument is highly
recommended. This argument will make sure that the result of fullmatch
will
be in the same order as the original data.frame
that was used to build the
distance specification. This will make appending the results of fullmatch
on to the original data.frame
much more convenient.
The fullmatch
function as several arguments for fine tuning the allowed
ratio of treatment to control units in a match, and how much of the pool to
throw away as unmatchable. One common pattern for these arguments are pairs:
one treated to one control unit. Not every distance specification is amendable
to this pattern (e.g. when there are more treated units than control units in
exactMatch
created stratum). However, it can be done with the Mahalanobis
distance matrix we created earlier:
mahal.match <- pairmatch(distances$mahal, data = W)
Like fullmatch
, pairmatch
also allows fine tuning the ratio of matches to
allow larger groupings. It is can be helpful as it computes what percentage of
the group to throw away, giving better odds of successfully finding a matching
solution.
Once one has generated a match, you may wish to view the results. The results
of calls to fullmatch
or pairmatch
produce optmatch
objects (specialized
factors). This object has a special option to the print
method which groups
the units by factor level:
print(mahal.match, grouped = T)
If you wish to join the match factor back to the original data.frame
:
W.matched <- cbind(W, matches = mahal.match)
Make sure to include the data
argument to fullmatch
or pairmatch
,
otherwise results are not guaranteed to be in the same order as your original
data.frame
or matrix
.
This section will help you get the latest development version of optmatch
and
start using the latest features. Before starting, you should know which branch
you wish to install. Currently, the "master" branch is the main code base.
Additional features are added in their own branches. A list of branches is
available at (the optmatch project
page)[https://github.com/markmfredrickson/optmatch].
You must have the Fortran extensions for package building included. These can be had from CRAN: OS X, Windows.
We recommend using dev_mode
from the devtools
package to install
in-development version of the package so that you can keep the current CRAN
version as the primary package. Activating dev_mode
creates a secondary
library of packages which can only be accessed while in dev_mode
. Packages
normally installed can still be used, but if different versions are installed
normally and in dev_mode
, the dev_mode
version takes precedent if in
dev_mode
.
Install and load the devtools
package:
> install.packages("devtools")
> library("devtools")
Activate dev_mode
:
> dev_mode()
d>
Note that the prompt changes from >
to d>
to let you know you're in
dev_mode
. Now choose the development branch you want to use. To install
master
:
d> install_github("markmfredrickson/optmatch")
Either way, the package is then loaded in the usual fashion, provided you're
still in dev_mode
:
d> library(optmatch)
Once you've done this you can disable dev_mode
as follows
d> dev_mode()
>
The development version of the package remains loaded.
Note that if you load the package -- ie, enter library(optmatch)
(when the
package hasn't already been loaded otherwise) -- while not in dev_mode
, then
you'll get whatever version of the package may be installed in your library
tree, not this development version.
If you want to switch between versions of RItools
, we suggest re-starting R.
You may use RStudio to develop for Optmatch, by opening the optmatch.Rproj
file.
We suggest you ensure all required dependencies are installed by running
devtools::install_deps(dependencies = TRUE)
We prefer changes that include unit tests demonstrating the problem or showing
how the new feature should be added. The test suite uses the
testthat package to write and run tests.
(Please ensure you have the latest version of testthat (or at least v0.11.0),
as older versions stored the tests in a different directory, and may not
test properly.) See the tests/testthat
directory for examples. You can run
the test suite via Build -> Test Package.
New features should include inline Roxygen documentation.
You can generate all .Rd
documents from the Roxygen
code using Build ->
Document.
Finally, you can use Build -> Build and Reload or Build -> Clean and Rebuild to
load an updated version of optmatch
in your current RStudio session.
Alternatively, to install the developed version permanently, use Build -> Build
Binary Version, followed by
install.packages("../optmatch_VERSION.tgz", repo=NULL)
You can revert back to the current CRAN version by
remove.packages("optmatch")install.packages("optmatch")
If you prefer not to use RStudio, you can develop using Make.
make test
: Run the full test suite.make document
: Update all documentation from Roxygen inline comments.make interactive
: Start up an interactive session with optmatch
loaded.make check
: Run R CMD check
on the packagemake build
: Build a binary package.make vignette
: Builds any vignettes in vignettes/
directorymake clean
: Removes files built by make vignette
, make document
or make check
.
Should not be generally necessary, but can be useful for debugging.When your change is ready, make a pull request on github.
CHANGES IN OPTMATCH VERSION 0.9-10
CHANGES IN OPTMATCH VERSION 0.9-9
summary.optmatch
.if(vectorOfThings)
usage that will give an error in upcoming R release.CHANGES IN OPTMATCH VERSION 0.9-8
controls
times the number of treatments, it now attempts to match in that
stratum by leaving out some of the treatment units. (#116)treatment_new = treatment == "T"
.data
argument is excluded from fullmatch
or pairmatch
and num_NA
> 0
entries in the treatment status vector are NA, then the length of the vector
produced by fullmatch
or pairmatch
won't match the length of the treatment status
vector, having num_NA
fewer observations. Don't forget to pass a data
argument!CHANGES IN OPTMATCH VERSION 0.9-7
CHANGES IN OPTMATCH VERSION 0.9-6
summary
methods for InfinitySparseMatrix, BlockedInfinitySparseMatrix
and DenseMatrix. I.e., you can call summary
on the result of a call to
match_on
or caliper
. The information this returns may be useful for
selecting caliper widths, and for managing computational burdens with large
matching problems.pairmatch()
,
fullmatch()
or match_on
, then the factor "fac" will both serve as
an independent variable for the propensity model and an exact matching
variable (#101). See the examples on the help documentation for fullmatch
.pairmatch
and fullmatch
no longer generate "matched.distances" attributes
for their results. To get this information, use matched.distances
.fill.NAs
directly to glm
or
similar. Use the traditional formula and data
argument version. See help
documentation for fill.NAs
for examples.boxplot
method for fitted propensities ignoring varwidth argument (#113); various
minor issues affecting package development and deployment (#110,...).CHANGES IN OPTMATCH VERSION 0.9-5
stratumStructure
.CHANGES IN OPTMATCH VERSION 0.9-4
contr.match_on
, a new default contrasts function for
making Mahalanobis and Euclidean distances. Previously we used R
defaults, which (a) generated different answers for the same factor
depending on the ordering of the levels and (b) led to different
distances for {0,1}-valued numeric variables and two level
factors. (#80)fullmatch
with feasible combinations of
min.controls
, mean.controls
/max.controls
and max.controls
(#92)CHANGES IN OPTMATCH VERSION 0.9-3
fullmatch
or pairmatch
to create distance specifications
directly.glm
method for match_on
that caused
observations with fixable NAs to be dropped too often.distUnion
allows combining arbitrary distance
specifications.antiExactMatch
provides for matches that may only
occur between treated and control units with different values on a
factor variable. This is the opposite of exactMatch
, which ensures
matches occur within factor levels.data
argument in more cases when using the summary
method when the RItools package is present.CHANGES IN OPTMATCH VERSION 0.9-2
omit.fraction
argument
when there are unmatched controls.minExactMatch
function.fullmatch
.caliper
function that allows returning values
that fit the caliper instead of just indicators of which entries fit
the caliper width.match_on
.CHANGES IN OPTMATCH VERSION 0.9-1
CHANGES IN OPTMATCH VERSION 0.9-0
NEW FEATURES
Solver limits now depend on machine limits, not arbitrary constants defined by the optmatch maintainers. For large problems, users will see a warning, but the solver will attempt to solve.
fullmatch() and pairmatch() can now take distance generating arguments directly, instead of having to first call match_on(). See the documentation for these two functions for more details.
Infeasibility recovery in fullmatch(). When passing a combination of constraints (e.g. max.controls) that would make the matching infeasible, fullmatch() will now attempt to find a feasible match that respects those constraints, which will likely result in omitting some controls units.
An additional argument to fullmatch(), mean.controls, is an alternative to the previous omit.fraction. (Only one of the two arguments can be presented.) The match will attempt to average mean.controls number of controls per treatment.
Each optmatch object now carries with it the constraints used to generate it (e.g. max.controls) as well as a hashed version of the distance it matched up, to help with some debugging/error checking but avoiding having to carry the entire distance matrix around.
Creating a distance matrix prior to matching is now optional. fullmatch() now accepts arguments from which match_on() would create a distance, and create the match behind the scenes.
Performance enhancements for distance calculations.
Several new utility functions, including subdim(), optmatch_restrictions(), optmatch_same_distance(), num_eligible_matches(). See their help documentation for additional details.
Arithmetic operations between InfinitySparseMatrices and vectors are supported. The operation is carried out as column by vector steps.
scores() function allows including model predictions (such as propensity scores) in formulas directly (such as combining multiple propensity scores). The scores() function is preferred to predict() as it makes several smart choices to avoid dropping observations due to partial missingness and other useful preparations for matching.
BUG FIXES
match_on is now a S3 generic function, which solves several bugs using propensity models from other packages.
summary() method was giving overly pessimistic warnings about failures.
fixed bug in how optmatch objects were printing.
DEPRECATED AND DEFUNCT
CHANGES IN OPTMATCH VERSION 0.8-3
CHANGES IN OPTMATCH VERSION 0.8-2
full() and pair() are now aliases to fullmatch() and pairmatch()
All match_on() methods take caliper
arguments (formerly just the numeric
method and derived methods had this argument).
boxplot methods for fitted propensity score methods (glm and bigglm)
fill.NAs now takes contrasts.arg
argument to mimic model.matrix()
Several bug fixes in examples, documentation
The methods pscore.dist() and mahal.dist() are now deprecated, with useful error messages pointing users to replacements.
Significant performance improvements for sparse matching problems.
Functions umatched() and matched() were backwards. Corrected.
CHANGES IN OPTMATCH VERSION 0.8-1
CHANGES IN OPTMATCH VERSION 0.8-0
NEW FEATURES
More efficient data structure for sparse matching problems, those with
relatively few allowed (finite) distances between units. Sparse problems
often arise when calipers are employed. The new data structure
(InfinitySparseMatrix
) behaves like a simple matrix, allowing cbind
,
rbind
, and subset
operations, making it easier to work with the older
optmatch.dlist
data structure.
match_on: A series of methods to generate matching problems using the new
data structure when appropriate, or using a standard matrix when the problem
is dense. This function is being deployed along side the mdist
function to
provide complete backward compatibility. New development will focus on this
function for distance creation, and users are encouraged to use it right
away. One difference for mdist
users is the within
argument. This
argument takes an existing distance specification and limits the new
comparisons to only those pairs that have finite distances in the within
argument. See the match_on
, exactMatch
, and caliper
documentation for
more details.
exactMatch: A new function to create stratified matching problems (in which
cross strata matches are forbidden). Users can specify the strata using
either a factor vector or a convenient formula interface. The results can be
used in calls match_on
to limit distance calculations to only with-in
strata treatment-control pairs.
New data
argument to fullmatch
and pairmatch
: This argument will set
the order of the match to that of the row.names
, names
, or contents of
the passed data.frame
or vector
. This avoids potential bugs caused when
the optmatch
objects were in a different order than users' data.
Test suite expanded and now uses the testthat library.
fill.NAs allows (optionally) filling in all columns (previously, the first column was assumed to be an outcome or treatment indicator and was not filled in).
New tools to find minimum feasible constraints: Large matching problems could
exceed the upper limit for a matching problem. The functions minExactmatch
and maxCaliper
find the smallest interaction of potential factors for
stratified matchings or the largest (most generous) caliper, respectively,
that make the problem small enough to fit under the maximum problem size
limit. See the help pages for these functions for more information.
BUG FIXES
fullmatch
to
other functions.FOR A DETAILED CHANGELOG, SEE https://github.com/markmfredrickson/optmatch
CHANGES IN OPTMATCH VERSION 0.7-1
NEW FEATURES
pairmatch() has a new option, "remove.unmatchables," that may be useful in conjunction with caliper matching. With "remove.unmatchables=TRUE", prior to matching any units with no counterparts within caliper distance are removed. Pair matching can still fail, if for example for two distinct treatment units only a single control, the same one, is available for matching to them; but remove.unmatchables eliminates one simple and common reason for pair matching to fail.
Applying summary() to an optmatch object now creates a "summary.optmatch" containing the summary information, in addition to reporting it to the console (via a summary.optmatch method for print() ).
mdist.formula() no longer requires an explicit data argument. I.e., you can get away with a call like "mdist(Treat~X1+X2|S)" if the variables Treat, X1, X2 and S are available in the environment you're working from (or in one of its parent environments). Previously you would have had to do "mdist(Treat~X1+X2|S, data=mydata)". (The latter formulation is still to be preferred, however, in part because with it mdist() gets to use data's row names, whereas otherwise it would have to make up row names.)
CHANGES IN OPTMATCH VERSION 0.7
NEW FEATURES
BUG FIXES
mdist.function method now properly returns an optmatch.dlist object for use in summary.optmatch, etc.
mdist.function maintains label on grouping factor.
CHANGES IN OPTMATCH VERSION 0.6-1
NEW FEATURES
New mdist method to extract propensity scores from models fitted using bigglm in package "biglm".
mdist's formula method now understands grouping factors indicated with a pipe ("|")
informative error message for mdist called on numeric vectors
updated mdist documentation
CHANGES IN OPTMATCH VERSION 0.6
NEW FEATURES
There is a new generic function, mdist(), for creating matching distances. It accepts: fitted glm's, which it uses to extract propensity distances; formulas, which it uses to construct squared Mahalanobis distances; and functions, with which a user can construct his or her own type of distance. The function method is more intuitive to work with than the older makedist() function.
A new function, caliper(), builds on the mdist() structure to provide a convenient way to add calipers to a distance. In contrast to earlier ways of adding calipers, caliper() has an optional argument specify observations to be excluded from the caliper requirement --- this permits one to relax it for just a few observations, for instance.
summary.optmatch() now removes strata in which matching failed (b/c the matching problem was found to be infeasible) before summarizing. It also indicates when such strata are present, and how many observations fall in them.
Demo has been updated to reflect changes as of version 0.4, 0.5, 0.6.
DEPRECATED & DEFUNCT
BUG FIXES
subsetting of objects of class optmatch now preserves matched.distances attribute.
fixed bug in maxControlsCap/minControlsCap whereby they behaved unreliably on subclasses within which some subjects had no permissible matches.
Removed unnecessary panic in fullmatch when it was given a min.controls argument with attributes other than names (as when it is created by tapply()).
fixed bug wherein summary.optmatch fails to retrieve balance tests if given a propensity model that had function calls in its formula.
Documentation pages for fullmatch, pairmatch filled out a bit.
CHANGES IN OPTMATCH VERSION 0.5
NEW FEATURES: