# Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables

Fits linear models with endogenous regressor using latent instrumental variable approaches. The methods included in the package are Lewbel's (1997) higher moments approach as well as Lewbel's (2012) heteroscedasticity approach, Park and Gupta's (2012) joint estimation method that uses Gaussian copula and Kim and Frees's (2007) multilevel generalized method of moment approach that deals with endogeneity in a multilevel setting. These are statistical techniques to address the endogeneity problem where no external instrumental variables are needed. Note that with version 2.0.0 sweeping changes were introduced which greatly improve functionality and usability but break backwards compatibility.

Endogeneity arises when the independence assumption between an explanatory variable and the error in a statistical model is violated. Among its most common causes are omitted variable bias (e.g. like ability in the returns to education estimation), measurement error (e.g. survey response bias), or simultaneity (e.g. advertising and sales).

Instrumental variable estimation is a common treatment when endogeneity is of concern. However valid, strong external instruments are difficult to find. Consequently, statistical methods to correct for endogeneity without external instruments have been advanced. They are called internal instrumental variable models (IIV).

REndo implements the following instrument-free methods:

1. latent instrumental variables approach (Ebbes, Wedel, Boeckenholt, and Steerneman 2005)

2. higher moments estimation (Lewbel 1997)

3. heteroskedastic error approach (Lewbel 2012)

4. joint estimation using copula (Park and Gupta 2012)

5. multilevel GMM (Kim and Frees 2007)

## The new version - REndo 2.0.0

The new version of REndo comes with a lot of improvements in terms of code optimization as well as different syntax for all functions.

Below, we present the new syntax for each function call:

### Latent Instrumental Variables

latentIV(y ~ P, data, start.params=c())

The first argument is the formula of the model to be estimated, y ~ P, where y is the response and P is the endogenous regressor. The second argument is the name of the dataset used and the last one, start.params=c(), which is optional, is a vector with the initial parameter values. When not indicated, the initial parameter values are taken to be the coefficients returned by the OLS estimator of y on P.

### Copula Correction

copulaCorrection( y ~ X1 + X2 + P1 + P2 | continuous(P1) + discrete(P2), data, start.params=c(), num.boots)

The first argument is a two-part formula of the model to be estimated, with the second part of the RHS defining the endogenous regressor, here continuous(P1) + discrete(P2). The second argument is the name of the data, the third argument of the function, start.params, is optional and represents the initial parameter values supplied by the user (when missing, the OLS estimates are considered); while the fourth argument, num.boots, also optional, is the number of bootstraps to be performed (the default is 1000). Of course, defining the endogenous regressors depends on the number of endogenous regressors and their assumed distribution. Transformations of the explanatory variables, such as I(X), ln(X) are supported.

### Higher Moments

higherMomentsIV(y ~ X1 + X2 + P | P | IIV(iiv = gp, g= x2, X1, X2) + IIV(iiv = yp) | Z1, data)

Here, y is the response; the first RHS of the formula, X1 + X2 + P, is the model to be estimated; the second part, P, specifies the endogenous regressors; the third part, IIV(), specifies the format of the internal instruments; the fourth part, Z1, is optional, allowing the user to add any external instruments available.

Regarding the third part of the formula, IIV(), it has a set of three arguments:

• iiv - specifies the form of the instrument,
• g - specifies the transformation to be done on the exogenous regressors,
• the set of exogenous variables from which the internal instruments should be built (it can be one or all of the exogenous variables).

A set of six instruments can be constructed, which should be specified in the iiv argument of IIV():

• g - for $(G_{t} - \bar{G})$,
• gp - for $(G_{t} - \bar{G})(P_{t}-\bar{P})$,
• gy - for $(G_{t} - \bar{G})(Y_{t}-\bar{Y})$,
• yp - for $(Y_{t} - \bar{Y})(P_{t}-\bar{P})$,
• p2 - for $(P_{t} - \bar{P})^2$,
• y2 - for $(Y_{t} - \bar{Y})^2$.

where $G=G(X_{t})$ can be either $x^2$, $x^3$, $ln(x)$ or $\frac{1}{x}$ and should be specified in the g argument of the third RHS of the formula, as x2, x3, lnx or 1/x. In case of internal instruments built only from the endogenous regressor, e.g. p2, or from the response and the endogenous regressor, like for example in yp, there is no need to specify g or the set of exogenous regressors in the IIV() part of the formula. The function returns a set of tests for checking the validity of the instruments and the endogeneity assumption.

### Heteroskedastic Errors

hetErrorsIV(y ~ X1 + X2 + X3 + P | P | IIV(X1,X2) | Z1, data)

Here, y is the response variable, X1 + X2 + X3 + P represents the model to be estimated; the second part, P, specifies the endogenous regressors, the third part, IIV(X1, X2), specifies the exogenous heteroskedastic variables from which the instruments are derived, while the final part Z1 is optional, allowing the user to include additional external instrumental variables. Like in the higher moments approach, allowing the inclusion of additional external variables is a convenient feature of the function, since it increases the efficiency of the estimates. Transformation of the explanatory variables, such as I(X), ln(X) are possible both in the model specification as well as in the IIV() specification.

### Multilevel GMM

multilevelIV(y ~ X11 + X12 + X21 + X22 + X23 + X31 + X33 + X34 + (1|CID) + (1|SID) | endo(X12), data)

The call of the function has a two-part formula and an argument for data specification. In the formula, the first part is the model specification, with fixed and random parameter specification, and the second part which specifies the regressors assumed endogenous, here endo(X12). The function returns the parameter estimates obtained with fixed effects, random effects and the GMM estimator proposed by Kim and Frees (2007), such that a comparison across models can be done.

## Installation Instructions

Install the stable version from CRAN:

install.packages("REndo")

Install the development version from GitHub:

devtools::install_github("mmeierer/REndo", ref = "development")

# CHANGES IN REndo 2.2.0

## SIGNIFICANT USER-VISIBLE CHANGES

• The augmented OLS method in copulaCorrection also bootstraps parameter estimates
• The summary output for results from copulaCorrection was adapted to reflect that standard errors are bootstrapped
• Removed support for the S3 method labels because of inconsistent behavior across methods

## NEW FEATURES

• Bootstrapping for copulaCorrection case 1 is now considerably faster
• New data was generated for dataMultilevelIV

## BUG FIXES

• The sigma matrix in latentIV is constructed as in the paper by Ebbes what improves results. Special thanks to Jordan Lawson for investigating and pointing this out!
• In the latentIV, the parameter for group membership (theta5) is transformed back and now reported correctly.
• The vcov matrix for latentIV is corrected for the transformation in theta5.
• The bootstrapping in copulaCorrection case 1 now creates samples of the same length as the original data
• The (percentile) confidence intervals and vcov for results from copulaCorrection now are derived with bootstrapping

# CHANGES IN REndo 2.1.0

## SIGNIFICANT USER-VISIBLE CHANGES

• The reworked method multilevelIV and accompanying data was added back to the package
• Vignette REndo-introduction was added to showcase package usage
• Users can supply a parameter optimx.args to tweak the LL optimization to their liking

## NEW FEATURES

• Method confint was added for methods latentIV and copulaCorrection
• Examples and documentation were improved for all methods
• New data was generated for dataHetIV
• The default number of iterations for all optimizations was increased to 100'000

## BUG FIXES

• To avoid infrequent warnings, the parameter sigma used in copulaCorrection was constrained to > 0
• Various spelling mistakes were fixed

# CHANGES IN REndo 2.0.0

## SIGNIFICANT USER-VISIBLE CHANGES

• Remodeled all methods' user-interface
• Added detailed input checks for every provided parameter
• Parameter verbose allows to turn on or off printing
• Updated documentation to reflect all changes and added theoretical background

## NEW FEATURES

• Formulas support transformations
• Improve all code to be more reliable and stable
• Provide new example datasets and accompanying documentation for all methods
• Added extensive testing for all aspects of the package
• Increased numerical stability for log-likelihood optimization methods latentIV and copualCorrection

• Many

# DEPRECATED AND DEFUNCT

• The multilevel function is temporarily removed from the package due to ongoing work on it

# Reference manual

install.packages("REndo")

2.4.3 by Raluca Gui, 5 months ago

https://github.com/mmeierer/REndo

Report a bug at https://github.com/mmeierer/REndo/issues

Browse source code at https://github.com/cran/REndo

Authors: Raluca Gui [cre, aut] , Markus Meierer [aut] , Rene Algesheimer [aut] , Patrik Schilter [aut]

Documentation:   PDF Manual