# Simulation of Correlated Systems of Equations with Multiple Variable Types

Generate correlated systems of statistical equations which represent repeated measurements or clustered data. These systems contain either: a) continuous normal, non-normal, and mixture variables based on the techniques of Headrick and Beasley (2004) or b) continuous (normal, non-normal and mixture), ordinal, and count (regular or zero-inflated, Poisson and Negative Binomial) variables based on the hierarchical linear models (HLM) approach. Headrick and Beasley's method for continuous variables calculates the beta (slope) coefficients based on the target correlations between independent variables and between outcomes and independent variables. The package provides functions to calculate the expected correlations between outcomes, between outcomes and error terms, and between outcomes and independent variables, extending Headrick and Beasley's equations to include mixture variables. These theoretical values can be compared to the simulated correlations. The HLM approach requires specification of the beta coefficients, but permits group and subject-level independent variables, interactions among independent variables, and fixed and random effects, providing more flexibility in the system of equations. Both methods permit simulation of data sets that mimic real-world clinical or genetic data sets (i.e. plasmodes, as in Vaughan et al., 2009, <10.1016/j.csda.2008.02.032>). The techniques extend those found in the 'SimMultiCorrData' and 'SimCorrMix' packages. Standard normal variables with an imposed intermediate correlation matrix are transformed to generate the desired distributions. Continuous variables are simulated using either Fleishman's third-order () or Headrick's fifth-order () power method transformation (PMT). Simulation occurs at the component-level for continuous mixture distributions. These components are transformed into the desired mixture variables using random multinomial variables based on the mixing probabilities. The target correlation matrices are specified in terms of correlations with components of continuous mixture variables. Binary and ordinal variables are simulated by discretizing the normal variables at quantiles defined by the marginal distributions. Count variables are simulated using the inverse CDF method. There are two simulation pathways for the multi-variable type systems which differ by intermediate correlations involving count variables. Correlation Method 1 adapts Yahav and Shmueli's 2012 method and performs best with large count variable means and positive correlations or small means and negative correlations. Correlation Method 2 adapts Barbiero and Ferrari's 2015 modification of the 'GenOrd' package and performs best under the opposite scenarios. There are three methods available for correcting non-positive definite correlation matrices. The optional error loop may be used to improve the accuracy of the final correlation matrices. The package also provides function to check parameter inputs and summarize the simulated systems of equations.

# SimRepeat

The goal of SimRepeat is to generate correlated systems of statistical equations which represent repeated measurements or clustered data. These systems contain either: a) continuous normal, non-normal, and mixture variables based on the techniques of Headrick and Beasley (Headrick and Beasley 2004) or b) continuous (normal, non-normal and mixture), ordinal, and count (regular or zero-inflated, Poisson and Negative Binomial) variables based on the hierarchical linear models (HLM) approach. Headrick and Beasley's method for continuous variables calculates the beta (slope) coefficients based on the target correlations between independent variables and between outcomes and independent variables. The package provides functions to calculate the expected correlations between outcomes, between outcomes and error terms, and between outcomes and independent variables, extending Headrick and Beasley's equations to include mixture variables. These theoretical values can be compared to the simulated correlations. The HLM approach requires specification of the beta coefficients, but permits group and subject-level independent variables, interactions among independent variables, and fixed and random effects, providing more flexibility in the system of equations. Both methods permit simulation of data sets that mimic real-world clinical or genetic data sets (i.e. plasmodes, as in Vaughan et al. (2009)).

The techniques extend those found in the SimMultiCorrData (Fialkowski 2017) and SimCorrMix (Fialkowski 2018) packages. Standard normal variables with an imposed intermediate correlation matrix are transformed to generate the desired distributions. Continuous variables are simulated using either Fleishman's third-order (Fleishman 1978) or Headrick's fifth-order (Headrick 2002) power method transformation (PMT). Simulation occurs at the component-level for continuous mixture distributions. These components are transformed into the desired mixture variables using random multinomial variables based on the mixing probabilities. The target correlation matrices are specified in terms of correlations with components of continuous mixture variables. Binary and ordinal variables are simulated by discretizing the normal variables at quantiles defined by the marginal distributions. Count variables are simulated using the inverse CDF method.

There are two simulation pathways for the multi-variable type systems which differ by intermediate correlations involving count variables. Correlation Method 1 adapts Yahav and Shmueli's 2012 method (Yahav and Shmueli 2012) and performs best with large count variable means and positive correlations or small means and negative correlations. Correlation Method 2 adapts Barbiero and Ferrari's 2015 modification of the GenOrd package (A. Barbiero and Ferrari 2015; Ferrari and Barbiero 2012; Barbiero and Ferrari 2015) and performs best under the opposite scenarios. There are three methods available for correcting non-positive definite correlation matrices. The optional error loop may be used to improve the accuracy of the final correlation matrices. The package also provides function to check parameter inputs and summarize the generated systems of equations.

There are vignettes which accompany this package that may help the user understand the simulation and analysis methods.

1. Theory and Equations for Correlated Systems of Continuous Variables describes the system of continuous variables generated with `nonnormsys` and derives the equations used in `calc_betas`, `calc_corr_y`, `calc_corr_ye`, and `calc_corr_yx`.

2. Correlated Systems of Statistical Equations with Non-Mixture and Mixture Continuous Variables provides examples of using `nonnormsys`.

3. The Hierarchical Linear Models Approach for a System of Correlated Equations with Multiple Variable Types describes the system of ordinal, continuous, and count variables generated with `corrsys` and `corrsys2`.

4. Correlated Systems of Statistical Equations with Multiple Variable Types provides examples of using `corrsys` and `corrsys2`.

## Installation instructions

SimRepeat can be installed using the following code:

## Example 1: System of three equations for 5 independent variables with no random effects

### Description of Variables

1. Ordinal variable: X_ord(1) has 3 categories (i.e., drug treatment) and is the same in each equation
2. Continuous variables:
1. X_cont(1) is a time-varying covariate (subject-level term) with an AR(1, p = 0.5) correlation structure
1. X_cont(11) has a Chisq(df = 2) distribution
2. X_cont(21) has a Chisq(df = 4) distribution
3. X_cont(31) has a Chisq(df = 8) distribution
1. X_mix(1) is a normal mixture time-varying covariate (subject-level term), components have an AR(1, p = 0.4) correlation structure across Y
1. Poisson variable: X_pois(1) is a zero-inflated Poisson variable with mean = 15, the probability of a structural zero set at 0.10, and is the same in each equation
2. Negative Binomial variable: X_nb(1) is a regular NB time-varying covariate (subject-level term) with an AR(1, p = 0.3) correlation structure and increasing mean and variance
1. X_nb(11) has a size of 10 and mean of 3
2. X_nb(21) has a size of 10 and mean of 4
3. X_nb(31) has a size of 10 and mean of 5
1. Error terms have a Beta(4, 1.5) distribution with an AR(1, p = 0.4) correlation structure. These require a sixth cumulant correction of 0.03.

There is an interaction between X_ord(1) and X_pois(1) for each Y. Since they are both group-level covariates, the interaction is also a group-level covariate that will interact with the subject-level covariates X_cont(1), X_mix(1) and X_nb(1). However, only X_ord(1) and X_pois(1) interact with time in this example. Normally their interaction would also interact with time. A description of this HLM model may be found in the package vignettes.

### Step 1: Set up parameter inputs

This is the most time-consuming part of the simulation process. It is important to read the function documentation carefully to understand the formats for each parameter input. Incorrect formatting will lead to errors. Most of these can be prevented by using the `checkpar` function in Step 2.

### Step 3: Generate system

Note that `use.nearPD = FALSE` and `adjgrad = FALSE` so that negative eigen-values will be replaced with `eigmin` (default 0) instead of using the nearest positive-definite matrix (found with Bates and Maechler (2017)'s `Matrix::nearPD` function by Higham (2002)'s algorithm) or the adjusted gradient updating method via `adj_grad` (Yin and Zhang 2013; Zhang and Yin Year not provided; Maree 2012).

c0 c1 c2 c3 c4 c5
-0.3077396 0.8005605 0.3187640 0.0335001 -0.0036748 0.0001587
0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
0.1629657 1.0899841 -0.1873287 -0.0449503 0.0081210 0.0014454

### Step 4: Describe results

Outcome N Mean SD Median Min Max Skew Skurtosis Fifth Sixth
Y1 1 10000 246.918 266.564 159.396 -109.731 2629.224 1.916 5.318 19.638 88.913
Y2 2 10000 337.357 326.326 238.730 -110.941 3104.254 1.676 3.841 11.058 39.062
Y3 3 10000 457.657 397.096 346.616 -25.791 3096.069 1.419 2.483 4.456 5.776
Outcome Mean SD Skew Skurtosis Fifth Sixth
E1 1 0 0.175 -0.694 -0.069 1.828 -3.379
E2 2 0 0.175 -0.694 -0.069 1.828 -3.379
E3 3 0 0.175 -0.694 -0.069 1.828 -3.379
Outcome N Mean SD Median Min Max Skew Skurtosis Fifth Sixth
E1 1 10000 0 0.174 0.027 -0.643 0.389 -0.683 -0.089 1.780 -3.071
E2 2 10000 0 0.174 0.028 -0.646 0.375 -0.698 -0.032 1.704 -3.365
E3 3 10000 0 0.175 0.028 -0.636 0.299 -0.706 -0.092 2.032 -3.845
Outcome X Mean SD Skew Skurtosis Fifth Sixth
cont1_1 1 1 2 2.000 2.000 6.0 24.000 120.0
cont1_2 1 2 -2 1.000 0.000 0.0 0.000 0.0
cont1_3 1 3 2 1.000 0.000 0.0 0.000 0.0
cont2_1 2 1 4 2.828 1.414 3.0 8.485 30.0
cont2_2 2 2 -2 1.000 0.000 0.0 0.000 0.0
cont2_3 2 3 2 1.000 0.000 0.0 0.000 0.0
cont3_1 3 1 8 4.000 1.000 1.5 3.000 7.5
cont3_2 3 2 -2 1.000 0.000 0.0 0.000 0.0
cont3_3 3 3 2 1.000 0.000 0.0 0.000 0.0
Outcome X N Mean SD Median Min Max Skew Skurtosis Fifth Sixth
cont1_1 1 1 10000 2.004 2.037 1.376 -0.448 21.396 2.067 6.262 24.655 118.059
cont1_2 1 2 10000 -2.000 1.001 -1.998 -5.736 1.813 0.009 0.028 -0.097 -0.326
cont1_3 1 3 10000 2.000 1.001 1.996 -2.759 6.091 -0.005 0.042 -0.215 0.721
cont2_1 2 1 10000 4.001 2.842 3.349 -0.274 25.813 1.427 3.031 8.367 26.853
cont2_2 2 2 10000 -2.000 1.002 -1.996 -5.927 2.091 0.019 -0.003 0.146 0.324
cont2_3 2 3 10000 2.000 1.002 2.008 -1.891 6.096 -0.013 -0.045 0.043 0.265
cont3_1 3 1 10000 7.999 3.994 7.389 0.272 36.025 1.016 1.637 3.531 8.915
cont3_2 3 2 10000 -2.000 1.001 -1.997 -6.062 1.377 -0.008 -0.025 -0.006 -0.212
cont3_3 3 3 10000 2.000 1.001 1.996 -1.830 5.986 0.017 0.075 -0.027 -0.201
Outcome X Mean SD Skew Skurtosis Fifth Sixth
mix1_1 1 1 0.4 2.2 -0.289 -1.154 1.793 6.173
mix2_1 2 1 0.4 2.2 -0.289 -1.154 1.793 6.173
mix3_1 3 1 0.4 2.2 -0.289 -1.154 1.793 6.173
Outcome X N Mean SD Median Min Max Skew Skurtosis Fifth Sixth
mix1_1 1 1 10000 0.4 2.2 1.052 -5.613 5.759 -0.293 -1.140 1.814 6.011
mix2_1 2 1 10000 0.4 2.2 1.032 -5.580 5.586 -0.304 -1.152 1.869 6.045
mix3_1 3 1 10000 0.4 2.2 1.038 -6.119 6.015 -0.279 -1.154 1.747 6.235

Summary of Ordinal Variable: (for Y_1)

Outcome Support Target Simulated
1 0 0.333 0.33
1 1 0.667 0.67

Summary of Poisson Variable:

Outcome X N P0 Exp_P0 Mean Exp_Mean Var Exp_Var Median Min Max Skew Skurtosis
1 1 10000 0.096 0.1 13.53 13.5 33.195 40 14 0 32 -0.832 0.755
2 1 10000 0.096 0.1 13.53 13.5 33.195 40 14 0 32 -0.832 0.755
3 1 10000 0.096 0.1 13.53 13.5 33.195 40 14 0 32 -0.832 0.755

Summary of Negative Binomial Variables X_nb(11), X_nb(21), and X_nb(31):

Outcome X N P0 Exp_P0 Prob Mean Exp_Mean Var Exp_Var Median Min Max Skew Skurtosis
1 1 10000 0.074 0.073 0.769 2.999 3 3.923 3.9 3 0 14 0.840 1.026
2 1 10000 0.036 0.035 0.714 4.002 4 5.592 5.6 4 0 18 0.762 0.838
3 1 10000 0.016 0.017 0.667 5.001 5 7.550 7.5 5 0 21 0.768 0.885

Maximum Correlation Errors for X Variables by Outcome:

Y1 Y2 Y3
Y1 0.02037 0.01822 0.01582
Y2 0.01822 0.00754 0.00773
Y3 0.01582 0.00773 0.00773

### Linear model

A linear model will be fit to the data using `glm` in order to see if the slope coefficients can be recovered (R Core Team 2017). First, the data is reshaped into long format using `reshape2::melt` (Wickham 2007). Note that since X_ord(1) and X_pois(1) are the same for each outcome, they will be used as factors (`id.vars`) and are only needed once.

Errors E_1, E_2, and E_3 modeled as having Normal distributions:

Each effect in the model was found to be statistically significant at the alpha = 0.001 level. Now, compare betas used in simulation to those returned by `glm`:

(Intercept) ord1_1 cont1 mix1 pois1_1 nb1
Simulated 0.000 0.500 0.750 1.000 1.25 1.5
Estimated 0.003 0.498 0.751 0.999 1.25 1.5
ord1_1:pois1_1 Time ord1_1:cont1 cont1:pois1_1 ord1_1:cont1:pois1_1 ord1_1:mix1
Simulated 0.5 1.000 0.500 0.6 0.7 0.800
Estimated 0.5 0.998 0.501 0.6 0.7 0.799
mix1:pois1_1 ord1_1:mix1:pois1_1 ord1_1:nb1 pois1_1:nb1 ord1_1:pois1_1:nb1 ord1_1:Time pois1_1:Time
Simulated 0.9 1 1.1 1.2 1.3 0.250 0.5
Estimated 0.9 1 1.1 1.2 1.3 0.249 0.5

All of the slope coefficients are estimated well.

## Example 2: System with 4 equations plus random intercept and random slope for time

### Description of Variables

1. Ordinal variable XO1, where Pr[XO1 = 0]=0.2, Pr[XO1 = 1]=0.35, and Pr[XO1 = 2]=0.45, is a group-level variable and is static across equations.
2. Continuous non-mixture variable XC1 is a subject-level variable with a Logistic(0, 1) distribution, which requires a sixth cumulant correction of 1.75.
3. X terms are correlated at 0.1 within an equation and have an AR(1) structure across equations. The correlations for the static variable are held constant across equations.
4. Random intercept U0 and time slope U1 with Normal(0, 1) distributions. Correlation between random effects is 0.3.
5. The error terms have t(10) distributions (mean 0, variance 1) and an AR(1, 0.4) correlation structure.

In this example, the random intercept and time slope have continuous non-mixture distributions for all Y. However, the functions `corrsys` and `corrsys2` permit a combination of none, non-mixture, and mixture distributions across the Y (i.e., if `rand.int = c("non_mix", "mix", "none")` then the random intercept for Y_1 has a non-mixture, and the random intercept for Y_2 has a mixture distribution; there is no random intercept for Y_3). In addition, the distributions themselves can vary across outcomes. This is also true for random effects assigned to independent variables as specified in `rand.var`.

### Step 4: Describe results

Outcome N Mean SD Median Min Max Skew Skurtosis Fifth Sixth
Y1 1 10000 3.269 3.926 3.047 -14.527 25.868 0.374 0.829 1.476 5.402
Y2 2 10000 5.475 4.969 5.266 -10.926 29.478 0.254 -0.026 -0.052 0.751
Y3 3 10000 7.238 5.585 7.056 -13.822 31.864 0.151 -0.029 0.003 0.585
Y4 4 10000 9.099 6.403 8.950 -12.072 38.790 0.132 0.027 0.177 0.609
Outcome U Mean SD Skew Skurtosis Fifth Sixth
cont1_1 1 1 0 1 0 0 0 0
cont1_2 1 2 0 1 0 0 0 0
Outcome U N Mean SD Median Min Max Skew Skurtosis Fifth Sixth
U_int 1 1 10000 0 1 -0.004 -3.834 3.564 -0.016 0.040 -0.096 -0.343
U_T1 1 2 10000 0 1 0.014 -3.415 3.933 -0.003 -0.044 0.239 -0.171

Maximum Correlation Error for Random Effects:

### Linear mixed model

A linear mixed model will be fit to the data using `lme` from package nlme in order to see if the random effects are estimated according to the simulation parameters (Pinheiro et al. 2017). The data is again reshaped into long format using `reshape2::melt`.

Errors modeled as having Gaussian distributions with an AR(1) correlation structure:

Each effect in the model was again found to be statistically significant at the α = 0.001 level.

Now, compare betas used in simulation to those returned by `lme`:

Simulated Value Std.Error DF t-value p-value
(Intercept) 0.00 -0.003 0.032 29996 -0.108 0.914
ord1_1 1.00 1.001 0.022 9998 46.435 0.000
cont1 1.00 0.997 0.007 29996 151.820 0.000
Time 1.00 1.012 0.021 29996 48.222 0.000
ord1_1:cont1 0.50 0.502 0.004 29996 111.984 0.000
ord1_1:Time 0.75 0.741 0.014 29996 51.844 0.000

Estimated standard deviation and AR(1) parameter for error terms:

Summary of estimated random effects:

Cor SD_int SD_Tsl
0.309 0.991 0.999

## References

Barbiero, A, and P A Ferrari. 2015. “Simulation of Correlated Poisson Variables.” Applied Stochastic Models in Business and Industry 31: 669–80. https://doi.org/10.1002/asmb.2072.

Barbiero, Alessandro, and Pier Alda Ferrari. 2015. GenOrd: Simulation of Discrete Random Variables with Given Correlation Matrix and Marginal Distributions. https://CRAN.R-project.org/package=GenOrd.

Bates, Douglas, and Martin Maechler. 2017. Matrix: Sparse and Dense Matrix Classes and Methods. https://CRAN.R-project.org/package=Matrix.

Ferrari, P A, and A Barbiero. 2012. “Simulating Ordinal Data.” Multivariate Behavioral Research 47 (4): 566–89. https://doi.org/10.1080/00273171.2012.692630.

Fialkowski, A C. 2017. SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types. https://CRAN.R-project.org/package=SimMultiCorrData.

———. 2018. SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions. https://CRAN.R-project.org/package=SimCorrMix.

Fleishman, A I. 1978. “A Method for Simulating Non-Normal Distributions.” Psychometrika 43: 521–32. https://doi.org/10.1007/BF02293811.

Headrick, T C. 2002. “Fast Fifth-Order Polynomial Transforms for Generating Univariate and Multivariate Non-Normal Distributions.” Computational Statistics and Data Analysis 40 (4): 685–711. https://doi.org/10.1016/S0167-9473(02)00072-5.

Headrick, T C, and T M Beasley. 2004. “A Method for Simulating Correlated Non-Normal Systems of Statistical Equations.” Communications in Statistics - Simulation and Computation 33 (1): 19–33. http://dx.doi.org/10.1081/SAC-120028431.

Higham, N. 2002. “Computing the Nearest Correlation Matrix - a Problem from Finance.” IMA Journal of Numerical Analysis 22 (3): 329–43. https://doi.org/10.1093/imanum/22.3.329.

Maree, S. 2012. “Correcting Non Positive Definite Correlation Matrices.” BSc Thesis Applied Mathematics, Delft University of Technology. https://repository.tudelft.nl/islandora/object/uuid:2175c274-ab03-4fd5.../download.

Pinheiro, Jose, Douglas Bates, Saikat DebRoy, Deepayan Sarkar, and R Core Team. 2017. nlme: Linear and Nonlinear Mixed Effects Models. https://CRAN.R-project.org/package=nlme.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Vaughan, L K, J Divers, M Padilla, D T Redden, H K Tiwari, D Pomp, and D B Allison. 2009. “The Use of Plasmodes as a Supplement to Simulations: A Simple Example Evaluating Individual Admixture Estimation Methodologies.” Computational Statistics and Data Analysis 53 (5): 1755–66. https://doi.org/10.1016/j.csda.2008.02.032.

Wickham, H. 2007. “Reshaping Data with the reshape Package.” Journal of Statistical Software 21 (12): 1–20. http://www.jstatsoft.org/v21/i12/.

Yahav, I, and G Shmueli. 2012. “On Generating Multivariate Poisson Data in Management Science Applications.” Applied Stochastic Models in Business and Industry 28 (1): 91–102. https://doi.org/10.1002/asmb.901.

Yin, JF, and Y Zhang. 2013. “Alternative Gradient Algorithms for Computing the Nearest Correlation Matrix.” Applied Mathematics and Computation 219 (14): 7591–9. https://doi.org/10.1016/j.amc.2013.01.045.

Zhang, Y, and JF Yin. Year not provided. “Modified Alternative Gradients Algorithm for Computing the Nearest Correlation Matrix.” Internal paper. Shanghai: Tongji University.

# SimRepeat 0.1.0

Initial package release.

# Reference manual

install.packages("SimRepeat")

0.1.0 by Allison Cynthia Fialkowski, 10 months ago

https://github.com/AFialkowski/SimRepeat

Browse source code at https://github.com/cran/SimRepeat

Authors: Allison Cynthia Fialkowski

Documentation:   PDF Manual