Streamlined Estimation of Survival for Static, Dynamic and Stochastic Treatment and Monitoring Regimes

Analysis of longitudinal time-to-event or time-to-failure data. Estimates the counterfactual discrete survival curve under static, dynamic and stochastic interventions on treatment (exposure) and monitoring events over time. Estimators (IPW, MSM-IPW, GCOMP, longitudinal TMLE) adjust for measured time-varying confounding and informative right-censoring. Model fitting can be performed either with GLM or H2O-3 machine learning libraries. The exposure, monitoring and censoring variables can be coded as either binary, categorical or continuous. Each can be multivariate (e.g., can use more than one column of dummy indicators for different censoring events). The input data needs to be in long format.

Streamlined analysis of longitudinal time-to-event or time-to-failure data. Estimates the counterfactual discrete survival curve under static, dynamic and stochastic interventions on treatment (exposure) and monitoring events over time. Estimators (IPW, GCOMP, TMLE) adjust for measured time-varying confounding and informative right-censoring. Model fitting can be performed either with glm or H2O-3machine learning libraries, including Ensemble Learning (SuperLearner).

Currently implemented estimators include:

  • Kaplan-Meier Estimator. No adjustment for time-varying confounding or informative right-censoring.
  • Inverse Probability Weighted (IPW) Kaplan-Meier. Also known as the Adjusted Kaplan Meier (AKME). Also known as the saturated (non-parametric) IPW-MSM estimator of the survival hazard. This estimator inverse weights each observation based on the exposure/censoring model fits (propensity scores).
  • Bounded Inverse Probability Weighted (B-IPW) Estimator of Survival. Estimates the survival directly (without hazard), also based on the exposure/censoring model fit (propensity scores).
  • Inverse Probability Weighted Marginal Structural Model (IPW-MSM) for the hazard function, mapped into survival. Currently only logistic regression is allowed where covariates are time-points and regime/rule indicators. This estimator is also based on the exposure/censoring model fit (propensity scores), but allows additional smoothing over multiple time-points and includes optional weight stabilization.
  • Sequential G-Computation (GCOMP). Also known as the recursive G-Computation formula or Q-learning. Directly estimates the outcome model while adjusting for time-varying confounding. Estimation can be stratified by rule/regime followed or pooled across all rules/regimes.
  • Targeted Maximum Likelihood Estimator (TMLE) for longitudinal data. Also known as the Targeted Minimum Loss-based Estimator. Doubly robust and semi-parametrically efficient estimator that targets the initial outcome model fits (GCOMP) with IPW.
  • Iterative Targeted Maximum Likelihood Estimator (I-TMLE) for longitudinal data. Fits sequential G-Computation and then iteratively performs targeting for all pooled Q's until convergence.

Input data:

  • Time-to-event (possibly) right-censored data has to be in long format.
  • Each row must contain a subject identifier (ID) and the integer indicator of the current time (t), e.g., day, week, month, year.
  • The package assumes that the temporal ordering of covariates in each row is fixed according to (ID, t, L,C,A,N,Y), where
    • L -- Time-varying and baseline covariates.
    • C -- Indicators of right censoring events at time t; this can be either a single categorical or several binary columns.
    • A -- Exposure (treatment) at time t; this can be multivariate (more than one column) and each column can be binary, categorical or continuous.
    • N -- Indicator of being monitored at time point t+1 (binary).
    • Y -- Time-to-event outcome (binary).
  • Note that the follow-up is assumed to end when either the outcome of interest (Y[t]=1) or right-censoring events are observed.
  • Categorical censoring can be useful for representing all of the censoring events with a single column (variable).

Model fitting:

  • Separate models are fit for the observed censoring, exposure and monitoring mechanisms.
  • Each model can be stratified (separate model is fit) by time or any other user-specified stratification criteria. Each strata is defined with by a single logical expression that selects specific observations/rows in the observed data (strata).
  • By default, all models are fit using GLM with binomial family (logistic regression).
  • Alternatively, model fitting can be also performed with any machine learning algorithm implemented in H2O-3 (faster distributed penalized GLM, Random Forest, Gradient Boosting Machines and Deep Neural Network).
  • Finally, one can select the best model from an ensemble of H2O learners via cross-validation. Grid search (h2o.grid) allows for user-friendly model specification and fitting over multi-dimensional parameter space with various stopping criteria (random, discrete, max number of models, max time allocated, etc).
  • The ensemble of many models can be combined into a single (more powerful) model with SuperLearner (h2oEmsemble).


To install the development version (requires the devtools package):

devtools::install_github('osofr/stremr', build_vignettes = FALSE)

For optimal performance, we also recommend installing the development version of data.table:


For modeling with H2O-3 machine learning libraries we recommend directly installing the latest version of the h2o R package (can also see the instructions here):

if ("package:h2o" %in% search()) detach("package:h2o", unload=TRUE)
if ("h2o" %in% rownames(installed.packages())) remove.packages("h2o")
# Next, download H2O package dependencies:
pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
new.pkgs <- setdiff(pkgs, rownames(installed.packages()))
if (length(new.pkgs)) install.packages(new.pkgs)
# Download and install the H2O package for R:
install.packages("h2o", type="source", repos=(c("")))

For ensemble learning with SuperLearner we recommend installing the latest development version of the h2oEnsemble R package (can also see the instructions here):


Documentation with general overview of the package functions and datasets:


To obtain documentation for specific relevant functions in stremr package:


The following is an example of a function call that produces an automated html report shown below. For a pdf report just set the argument format = "pdf".

  make_report_rmd(OData, NPMSM = list(surv1, surv2), 
                  MSM = MSM.IPAW, 
                  GCOMP = list(gcomp_est1, gcomp_est2), 
                  TMLE = list(tmle_est_par1, tmle_est_par2),
                  AddFUPtables = TRUE, RDtables = get_MSM_RDs(MSM.IPAW, t.periods.RDs = c(12, 15), getSEs = TRUE),
                  WTtables = get_wtsummary(MSM.IPAW$wts_data, cutoffs = c(0, 0.5, 1, 10, 20, 30, 40, 50, 100, 150), by.rule = TRUE),
         = "", title = "Custom Report Title", author = "Author Name", y_legend = 0.99, x_legend = 9.5)

Load the data:

OdataDT <-, key=c(ID, t))

Define some summaries (lags):

OdataDT[, ("N.tminus1") := shift(get("N"), n = 1L, type = "lag", fill = 1L), by = ID]
OdataDT[, ("TI.tminus1") := shift(get("TI"), n = 1L, type = "lag", fill = 1L), by = ID]

Import input data into stremr object DataStorageClass and define relevant covariates:

OData <- importData(OdataDT, ID = "ID", t = "t", covars = c("highA1c", "lastNat1", "N.tminus1"), CENS = "C", TRT = "TI", MONITOR = "N", OUTCOME = "Y.tplus1")

Define counterfactual exposures. In this example we define one intervention as always treated and another as never treated. Such intervention can be defined conditionally on other variables (dynamic intervention). Similarly, one can define the intervention as a probability that the counterfactual exposure is 1 at each time-point t (for stochastic interventions).

OdataDT[, ("TI.set1") := 1L]
OdataDT[, ("TI.set0") := 0L]

Regressions for modeling the propensity scores for censoring (CENS), exposure (TRT) and monitoring (MONITOR). By default, each of these propensity scores is fit with a common model that pools across all available time points (smoothing over time).

gform_CENS <- "C + TI + N ~ highA1c + lastNat1"
gform_TRT <- "TI ~ CVD + highA1c + N.tminus1"
gform_MONITOR <- "N ~ 1"

Stratification, that is, fitting separate models for different time-points, is enabled with logical expressions in arguments stratify_... (see ?fitPropensity). For example, the logical expression below states that we want to fit the censoring mechanism with a separate model for time point 16, while pooling with a common model fit over time-points 0 to 15. Any logical expression can be used to define such stratified modeling. This can be similarly applied to modeling the exposure mechanism (stratify_TRT) and the monitoring mechanism (stratify_MONITOR).

stratify_CENS <- list(C=c("t < 16", "t == 16"))

Fit the propensity scores for censoring, exposure and monitoring:

OData <- fitPropensity(OData, gform_CENS = gform_CENS, gform_TRT = gform_TRT, gform_MONITOR = gform_MONITOR, stratify_CENS = stratify_CENS)

Estimate survival based on non-parametric MSM (IPTW-ADJUSTED KM):

AKME.St.1 <- getIPWeights(OData, intervened_TRT = "TI.set1") %>%
             survNPMSM(OData) %$%

Estimate survival with bounded IPW:

IPW.St.1 <- getIPWeights(OData, intervened_TRT = "TI.set1") %>%

Estimate hazard with IPW-MSM then map into survival estimate. Using two regimens and smoothing over two intervals of time-points:

wts.DT.1 <- getIPWeights(OData = OData, intervened_TRT = "TI.set1", rule_name = "TI1")
wts.DT.0 <- getIPWeights(OData = OData, intervened_TRT = "TI.set0", rule_name = "TI0")
survMSM_res <- survMSM(list(wts.DT.1, wts.DT.0), OData, t_breaks = c(1:8,12,16)-1,)

Define time-points of interest, regression formulas and software to be used for fitting the sequential outcome models:

t.surv <- c(0:15)
Qforms <-"Q.kplus1 ~ CVD + highA1c + N + lastNat1 + TI + TI.tminus1", (max(t.surv)+1))
params = list(fit.package = "speedglm", fit.algorithm = "glm")

G-Computation (pooled):

gcomp_est <- fitSeqGcomp(OData, t_periods = t.surv, intervened_TRT = "TI.set1", Qforms = Qforms, params_Q = params, stratifyQ_by_rule = FALSE)

Targeted Maximum Likelihood Estimation (TMLE) (stratified):

tmle_est <- fitTMLE(OData, t_periods = t.surv, intervened_TRT = "TI.set1", Qforms = Qforms, params_Q = params, stratifyQ_by_rule = TRUE)

To parallelize estimation over several time-points (t.surv) for either GCOMP or TMLE use argument parallel = TRUE:

registerDoParallel(cores = 40)
tmle_est <- fitTMLE(OData, t_periods = t.surv, intervened_TRT = "TI.set1", Qforms = Qforms, params_Q = params, stratifyQ_by_rule = TRUE, parallel = TRUE)

To perform all modeling with H2O-3 distributed Random Forest algorithm just set the global package options fit.package = "h2o" and fit.algorithm = "randomForest" prior to calling any fitting function:

set_all_stremr_options(fit.package = "h2o", fit.algorithm = "randomForest")
h2o::h2o.init(nthreads = -1)
OData <- fitPropensity(OData, gform_CENS = gform_CENS, gform_TRT = gform_TRT, gform_MONITOR = gform_MONITOR, stratify_CENS = stratify_CENS)

Other available algorithms are H2O-3 Gradient Boosting Machines (fit.algorithm = "gbm"), distributed GLM (including LASSO and Ridge) (fit.algorithm = "glm") and Deep Neural Nets (fit.algorithm = "deeplearning").

Use arguments params_... in fitPropensity() and params_Q in fitSeqGcomp() and fitTMLE() to pass various tuning parameters and select different algorithms for different models:

params_TRT = list(fit.package = "h2o", fit.algorithm = "gbm", ntrees = 50, learn_rate = 0.05, sample_rate = 0.8, col_sample_rate = 0.8, balance_classes = TRUE)
params_CENS = list(fit.package = "speedglm", fit.algorithm = "glm")
params_MONITOR = list(fit.package = "speedglm", fit.algorithm = "glm")
OData <- fitPropensity(OData,
          gform_CENS = gform_CENS, stratify_CENS = stratify_CENS, params_CENS = params_CENS,
          gform_TRT = gform_TRT, params_TRT = params_TRT,
          gform_MONITOR = gform_MONITOR, params_MONITOR = params_MONITOR)

Running TMLE based on the previous fit of the propensity scores. Also applying Random Forest to estimate the sequential outcome model:

params_Q = list(fit.package = "h2o", fit.algorithm = "randomForest", ntrees = 100, learn_rate = 0.05, sample_rate = 0.8, col_sample_rate = 0.8, balance_classes = TRUE)
tmle_est <- fitTMLE(OData, t_periods = t.surv, intervened_TRT = "TI.set1", Qforms = Qforms, params_Q = params_Q, stratifyQ_by_rule = TRUE)

###Ensemble Learning with SuperLearner (based on h2oEnsemble R package)


Easy specification of large ensembles with grid search:

  1. Define a learning algorithm (e.g., glm)
  2. Define the search criteria (e.g., 120 second maximum). Increase parameters max_runtime_secs or max_models to cover larger number of models from tuning parameter space.
  3. Define the space of tuning parameters (hyper-parameters) by specifying their learner-specific names and values for grid search (e.g., alpha and lambda for glm).

When running the SuperLearner with grid search, stremr calls the following outside functions:

  1. Runs h2o.grid in the background for each individual learner and saves cross-validated risks.
  2. Calls h2o.stack from h2oEnsemble package to evaluate the final SuperLearner fit on a combination of all learners returned by different grid searches and individually specified learners.

Here is an example defining the grid search criteria and search space of tuning parameters for h2o glm (h2o.glm):

GLM_hyper_params <- list(search_criteria = list(strategy = "RandomDiscrete", max_models = 5),
                         alpha = c(0,1,seq(0.1,0.9,0.1)),
                         lambda = c(0,1e-7,1e-5,1e-3,1e-1))

Another example with grid search for Random Forest (h2o.randomForest) (will be combined with above in a single SuperLearner ensemble):

search_criteria <- list(strategy = "RandomDiscrete", max_models = 5, max_runtime_secs = 60*60)
RF_hyper_params <- list(search_criteria = search_criteria,
                        ntrees = c(100, 200, 300, 500),
                        mtries = 1:4,
                        max_depth = c(5, 10, 15, 20, 25),
                        sample_rate = c(0.7, 0.8, 0.9, 1.0),
                        col_sample_rate_per_tree = c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8),
                        balance_classes = c(TRUE, FALSE))

Final example with grid search for Gradient Boosting Machines (h2o.gbm) (will be also combined with above grid searches):

GBM_hyper_params <- list(search_criteria = search_criteria,
                         ntrees = c(100, 200, 300, 500),
                         learn_rate = c(0.005, 0.01, 0.03, 0.06),
                         max_depth = c(3, 4, 5, 6, 9),
                         sample_rate = c(0.7, 0.8, 0.9, 1.0),
                         col_sample_rate = c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8),
                         balance_classes = c(TRUE, FALSE))

In addition, we can specify individual learners that we may want to include in the SuperLearner library:

h2o.glm.1 <- function(..., alpha = 0.0) h2o.glm.wrapper(..., alpha = alpha)
h2o.glm.2 <- function(..., x = "highA1c", alpha = 0.0) h2o.glm.wrapper(..., x = x, alpha = alpha)
h2o.glm.3 <- function(..., alpha = 1.0) h2o.glm.wrapper(..., alpha = alpha)

The SuperLearner ensemble is now defined with a single list of parameters that includes the above models. We also define additional SuperLearner-specific parameters here (such as, nfolds - number of folds for cross-validation, metalearner and seed):

SLparams = list(fit.package = "h2o", fit.algorithm = "SuperLearner",
                 grid.algorithm = c("glm", "randomForest", "gbm"),
                 learner = c("h2o.glm.1", "h2o.glm.2", "h2o.glm.3"),
                 metalearner = "h2o.glm_nn",
                 nfolds = 10,
                 seed = 23,
                 glm = GLM_hyper_params,
                 randomForest = RF_hyper_params,
                 gbm = GBM_hyper_params)

We can also save the SuperLearner fits by adding parameters save.ensemble and ensemble.dir.path. This will save the entire ensemble of models that were used by the SuperLearner. Separate directories are required for different SuperLearner models (for example a separate directory for censoring model and a separate directory for treatment model). These pre-saved fits can be loaded at a later time to avoid the lengthy refitting process by using the argument load.ensemble = TRUE.

params_TRT = c(SLparams, save.ensemble = TRUE, ensemble.dir.path = "./h2o-ensemble-model-TRT")

The following example fits the propensity score using above SuperLearner to model the exposure mechanism and using speedglm logistic regressions for censoring and monitoring:

params_CENS = list(fit.package = "speedglm", fit.algorithm = "glm")
params_MONITOR = list(fit.package = "speedglm", fit.algorithm = "glm")
OData <- fitPropensity(OData,
            gform_CENS = gform_CENS, stratify_CENS = stratify_CENS, params_CENS = params_CENS,
            gform_TRT = gform_TRT, params_TRT = params_TRT,
            gform_MONITOR = gform_MONITOR, params_MONITOR = params_MONITOR)

The following example loads the previously saved fits of the SuperLearner for the exposure. The only models fit during this call to fitPropensity are for the monitoring and censoring.

params_TRT = c(SLparams, load.ensemble = TRUE, ensemble.dir.path = "./h2o-ensemble-model-TRT")
OData <- fitPropensity(OData,
            gform_CENS = gform_CENS, stratify_CENS = stratify_CENS, params_CENS = params_CENS,
            gform_TRT = gform_TRT, params_TRT = params_TRT,
            gform_MONITOR = gform_MONITOR, params_MONITOR = params_MONITOR)

The SuperLearner for TMLE and GCOMP is specified in an identical fashion. One needs to specify the relevant parameters and the ensemble models as part of the params_Q argument. However, its currently not possible to save the individual SuperLearner fits of the outcome (Q) model.



This software is distributed under the GPL-2 license.


stremr 0.1

  • Initial version release to CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.4 by Oleg Sofrygin, 2 months ago

Report a bug at

Browse source code at

Authors: Oleg Sofrygin [aut, cre], Mark J. van der Laan [aut], Romain Neugebauer [aut]

Documentation:   PDF Manual  

GPL (>= 2) license

Imports assertthat, data.table, methods, R6, Rcpp, rmarkdown, pander, speedglm, stats, stringr, zoo

Suggests devtools, h2o, knitr, magrittr, RUnit, foreach, doParallel

Linking to Rcpp

System requirements: pandoc ( for generating and exporting markdown reports to other formats.

See at CRAN