A flexible tool for simulating complex longitudinal data using structural equations, with emphasis on problems in causal inference. Specify interventions and simulate from intervened data generating distributions. Define and evaluate treatment-specific means, the average treatment effects and coefficients from working marginal structural models. User interface designed to facilitate the conduct of transparent and reproducible simulation studies, and allows concise expression of complex functional dependencies for a large number of time-varying nodes. See the package vignette for more information, documentation and examples.
The simcausal
R package is a tool for specification and simulation of complex longitudinal data structures that are based on structural equation models (SEMs). The emphasis is on the types of simulations frequently encountered in causal inference problems, such as, observational data with time-dependent confounding, selection bias, and random monitoring processes. The interface allows for quick expression of dependencies between a large number of time-varying nodes.
To install the CRAN release version of simcausal
:
install.packages('simcausal')
To install the development version (requires the devtools
package):
devtools::install_github('osofr/simcausal', build_vignettes = FALSE)
Once the package is installed, see the vignette, consult the internal package documentation and examples.
vignette("simcausal_vignette", package="simcausal")
?simcausalhelp(package = 'simcausal')
news(package = "simcausal")
Below is an example simulating data with 4 covariates specified by 4 structural equations (nodes). New equations are added by using successive calls to + node()
function and data are simulated by calling sim
function:
library("simcausal")D <- DAG.empty() + node("CVD", distr="rcat.b1", probs = c(0.5, 0.25, 0.25)) + node("A1C", distr="rnorm", mean = 5 + (CVD > 1)*10 + (CVD > 2)*5) + node("TI", distr="rbern", prob = plogis(-0.5 - 0.3*CVD + 0.2*A1C)) + node("Y", distr="rbern", prob = plogis(-3 + 1.2*TI + 0.1*CVD + 0.3*A1C))D <- set.DAG(D)dat <- sim(D,n=200)
To display the above SEM object as a directed acyclic graph:
plotDAG(D)
To allow the above nodes A1C
, TI
and Y
to change over time, for time points t = 0,...,7, and keeping CVD
the same, simply add t
argument to node
function and use the square bracket [...]
vector indexing to reference time-varying nodes inside the node
function expressions:
library("simcausal")D <- DAG.empty() + node("CVD", distr="rcat.b1", probs = c(0.5, 0.25, 0.25)) + node("A1C", t=0, distr="rnorm", mean=5 + (CVD > 1)*10 + (CVD > 2)*5) + node("TI", t=0, distr="rbern", prob=plogis(-5 - 0.3*CVD + 0.5*A1C[t])) + node("A1C", t=1:7, distr="rnorm", mean=-TI[t-1]*10 + 5 + (CVD > 1)*10 + (CVD > 2)*5) + node("TI", t=1:7, distr="rbern", prob=plogis(-5 - 0.3*CVD + 0.5*A1C[t] + 1.5*TI[t-1])) + node("Y", t=0:7, distr="rbern", prob=plogis(-6 - 1.2*TI[t] + 0.1*CVD + 0.3*A1C[t]), EFU=TRUE)D <- set.DAG(D)dat.long <- sim(D,n=200)
The + action
function allows defining counterfactual data under various interventions (e.g., static, dynamic, deterministic, or stochastic), which can be then simulated by calling sim
function. In particular, the interventions may represent exposures to treatment regimens, the occurrence or non-occurrence of right-censoring events, or of clinical monitoring events.
In addition, the functions set.targetE
, set.targetMSM
and eval.target
provide tools for defining and computing a few selected features of the distribution of the counterfactual data that represent common causal quantities of interest, such as, treatment-specific means, the average treatment effects and coefficients from working marginal structural models.
Function network
provies support for networks simulations, in particular it enables defining and simulating SEM for dependent data. For example, a network sampling function like rnet.gnm
(provided by the package, see ?rnet.gnm
) can be used to specify and simulate dependent data from a network-based SEM. Start defining a SEM that uses the this network, with a +network
syntax and providing "rnet.gnm
" as a "netfun
" argument to network
function:
library("simcausal")library("magrittr")D <- DAG.empty() + network("ER.net", netfun = "rnet.gnm", m_pn = 50)
First define two IDD nodes W1
(categorical) and W2
(Bernoulli):
D <- D + node("W1", distr = "rcat.b1", probs = c(0.0494, 0.1823, 0.2806, 0.2680, 0.1651, 0.0546)) + node("W2", distr = "rbern", prob = plogis(-0.2 + W1/3))
New nodes (structural equations) can now be specified conditional on the past node values of observations connected to each unit i
(friends of i
). The friends are defined by the network matrix that is returned by the above network generator rnet.gnm
. Double square bracket syntax "[[...]]
" allows referencing the node values of connected friends. Two special variables, "Kmax
" and "nF
" can be used along-side indexing "[[...]]
". Kmax
defines the maximal number of friends (maximal friend index) for all observation. When kth
friend referenced in "Var[[k]]
" doesn't exist, the default is to set that value to "NA
". Adding the argument "replaceNAw0=TRUE
" to node
function changes such values from NA
to 0
. nF
is another special variable, which is a vector of length n
and each nF[i]
is equal to the current number of friends for unit i
. Any kind of summary function that can be applied to multiple time-varying nodes can be similarly applied to network-indexed nodes. For additional details, see the package documentation for the network function (?network
) and the package vignette on conducting network simulations.
Define network variable "netW1
" as the W1
values of the first friend and define binary exposure "A
" so that probability of success for each unit 'i' for A
is a logit-linear function of:
W1[i]
,W1
values among all friends of i
,W2
among all friends of i
.dat.net <- { D + node("netW1.F1", distr = "rconst", const = W1[[1]]) + node("A", distr = "rbern", prob = plogis(2 + -0.5 * W1 + -0.1 * sum(W1[[1:Kmax]]) + -0.7 * ifelse(nF > 0, sum(W2[[1:Kmax]])/nF, 0)), replaceNAw0 = TRUE)} %>%set.DAG() %>%sim(n=1000)
The simulated data frame returned by sim()
also contains the simulated network object, saved as a separate attribute. The network is saved as an R6
object of class NetIndClass
, under attribute called "netind_cl
". The field "NetInd
" contains the network matrix, the field "Kmax
" contains the maximum number of friends (number of columns in NetInd
) and the field "nF
" contains the vector for total number of friends for each observation (see ?NetIndClass
for more information).
(Kmax <- attributes(dat.net)$netind_cl$Kmax)NetInd_mat <- attributes(dat.net)$netind_cl$NetIndhead(NetInd_mat)nF <- attributes(dat.net)$netind_cl$nFhead(nF)
To cite simcausal
in publications, please use:
The development of this package was partially funded through internal operational funds provided by the Kaiser Permanente Center for Effectiveness & Safety Research (CESR). This work was also partially supported through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1403-12506) and an NIH grant (R01 AI074345-07).
This software is distributed under the GPL-2 license.
Development version update.
Added vignette; the node.depr function is no longer supported; updated the documentation for ?simcausal, doLTCF, sim, simobs and simfull, add.action, add.nodes functions; changed examples for node and set.DAG; added .onAttach message;
New CRAN version release.
Adding vignette; improved documentation for ?simcausal; changing the documentation for sim, simobs and simfull functions; updated documentation add.nodes/+node and add.action/+action syntax; for node.depr is removed from the user access.
New CRAN version release.
Major updates:
Addition of network() function that allows simulating networks;
Support for network-based sampling of dependent data. New syntax Variable[[netindx]] allows one to to index Variable values of friends (connections) (assuming a network has been already generated with the network() function);
User calling environment is now captured by each call to node() function; this is used as an enclosing environment when evaluating node() function arguments;
Moving to R6 class node call parser;
Minor updates:
Adding global option options(simcausal.verbose) that can be set to FALSE to suppress message printing; default is TRUE;
Adding network simulation example to /tests/Runit/gen.net.example.R, simulating networks using igraph package; Examples for simulating networks based on igraph ER model;
Adding conversion utility function to/from sparse adjacency network matrix to simcausal internet network storage;
Switching from warning to message when replacing/modifying existing DAG node object;
New CRAN version release.
Fixing a warning on v.0.3.0 for generic "melt" being exported by data.table and reshape2;
Minor updates to the vignette;
Fixing an issue with NetIndClass not calling self$make.nF() when outside network matrix is assigned;
Making nF (vector of the number of friends across observations) available as a special variable that can be used inside node expressions (as distributional parameters);
Major changes since last version 0.4.0.
Added support for time-varying nodes over network: can index friend nodes with syntactic sugar expressions like Node[[F_indx]], even when this Node is a time-varying random variable. For such nodes, start by using the syntactic indexing by t, i.e., Node[t], then add friend indexing with expressions like Node[t][[F_indx]]. See the forthcoming vignette on networks for examples.
Added support for latent variables, set.DAG has a new argument v.latent, any node names specified to this argument will be hidden from the simulated data. This allows one to explicitly define errors (U’s) for each node. This will be also plotted differently by plotDAG.
set.DAG has a new argument n.test, used for changing the default sample size of the simulation test performed by set.DAG (default n.test=100). Set n.test=0 to skip the DAG object test completely.
Node argument EFU can be a logical expression (any function of previously defined node names), which must evaluate to TRUE/FALSE. Right-censoring for a specific observation then occurs only if: 1) The node value evaluated to 1 & 2) EFU evaluates to TRUE
Added support for multivariate nodes, such as multivariate normal, copulas, etc. See a separate section on multivariate distributions in ?node.
A specially reserved variable name "Nsamp" can be used in any node expression, always evaluates to the currently simulated sample size (a column of constant values). Works similarly to the network-specific reserved variables "Kmax" and "nF" (see ?network).
Changed the format of plotDAG output. The internal (endogenous) nodes are no longer surrounded by circles. All latent nodes are surrounded by circles, all children of latent nodes are shown with dashed arrow.
Switched internal storage of simulated data to data.table. The output is always a data.frame.
More consistent control over non-standard evaluation. Can wrap any R expression supplied as the node argument in .() or eval() to skip all of simcausal non-standard evaluation and checks, this will evaluate the expression in the environment of the data + user calling environment. The default R subsetting operators for '[...]' and '[[...]]' will be applied to all expressions wrapped in .() and hence the simcausal operators for time-varying nodes and networks will not longer be applied. See a separate section in ?node and examples 7,8 for details. One can apply the operator .() to only part of the expression, to avoid simcausal parsing on that specific part as shown in Example 8 with .(coefAi[t]) * A[t-1]
.
Switching to new naming convention for categorical distributions (rcat.b0, rcat.b1 and rcat.factor). Old rcategor.int and rcategor are still available and are deprecated.