A User-Oriented Statistical Toolkit for Analytical Variance Estimation

Provides a toolkit for analytical variance estimation in survey sampling. Apart from the implementation of standard variance estimators, its main feature is to help the sampling expert produce easy-to-use variance estimation "wrappers", where systematic operations (linearization, domain estimation) are handled in a consistent and transparent way.


Gustave (Gustave: a User-oriented Statistical Toolkit for Analytical Variance Estimation) is an R package that provides a toolkit for analytical variance estimation in survey sampling.

Apart from the implementation of standard variance estimators (Sen-Yates-Grundy, Deville-Tillé), its main feature is to help he methodologist produce easy-to-use variance estimation wrappers, where systematic operations (statistic linearization, domain estimation) are handled in a consistent and transparent way.

The ready-to-use variance estimation wrapper qvar(), adapted for common cases (e.g. stratified simple random sampling, non-response correction through reweighting in homogeneous response groups, calibration), is also included. The core functions of the package (e.g. define_variance_wrapper()) are to be used for more complex cases.

gustave is available on CRAN and can therefore be installed with the install.packages() function:

install.packages("gustave")

However, if you wish to install the latest version of gustave, you can use devtools::install_github() to install it directly from the github.com repository:

install.packages("devtools")
devtools::install_github("martinchevalier/gustave")

Example

In this example, we aim at estimating the variance of estimators computed using simulated data inspired from the Information and communication technology (ICT) survey. This survey has the following characteristics:

  • stratified one-stage sampling design;
  • non-response correction through reweighting in homogeneous response groups based on economic sub-sector and turnover;
  • calibration on margins (number of firms and turnover broken down by economic sub-sector).

The ICT simulated data files are shipped with the gustave package:

library(gustave)
data(package = "gustave")
? ict_survey

Methodological description of the survey

A variance estimation can be perform in a single call of qvar():

qvar(

  # Sample file
  data = ict_sample,
  
  # Dissemination and identification information
  dissemination_dummy = "dissemination",
  dissemination_weight = "w_calib",
  id = "firm_id",
  
  # Scope
  scope_dummy = "scope",
  
  # Sampling design
  sampling_weight = "w_sample", 
  strata = "strata",
  
  # Non-response correction
  nrc_weight = "w_nrc", 
  response_dummy = "resp", 
  hrg = "hrg",
  
  # Calibration
  calibration_weight = "w_calib",
  calibration_var = c(paste0("N_", 58:63), paste0("turnover_", 58:63)),
  
  # Statistic(s) and variable(s) of interest
  mean(employees)
 
)

The survey methodology description is however cumbersome when several variance estimations are to be conducted. As it does not change from one estimation to another, it could be defined once and for all and then re-used for all variance estimations. qvar() allows for this by defining a so-called variance wrapper, that is an easy-to-use function where the variance estimation methodology for the given survey is implemented and all the technical data used to do so included.

# Definition of the variance estimation wrapper precision_ict
precision_ict <- qvar(

  # As before
  data = ict_sample,
  dissemination_dummy = "dissemination",
  dissemination_weight = "w_calib",
  id = "firm_id",
  scope_dummy = "scope",
  sampling_weight = "w_sample", 
  strata = "strata",
  nrc_weight = "w_nrc", 
  response_dummy = "resp", 
  hrg = "hrg",
  calibration_weight = "w_calib",
  calibration_var = c(paste0("N_", 58:63), paste0("turnover_", 58:63)),
  
  # Replacing the variables of interest by define = TRUE
  define = TRUE
  
)

# Use of the variance estimation wrapper
precision_ict(ict_sample, mean(employees))

# The variance estimation wrapper can also be used on the survey file
precision_ict(ict_survey, mean(speed_quanti))

Features of the variance estimation wrapper

The variance estimation wrapper is much easier-to-use than a standard variance estimation function:

  • several statistics in one call (with optional labels):

    precision_ict(ict_survey, 
      "Mean internet speed in Mbps" = mean(speed_quanti), 
      "Turnover per employee" = ratio(turnover, employees)
    )
    
  • domain estimation with where and by arguments

    precision_ict(ict_survey, 
      mean(speed_quanti), 
      where = employees >= 50
    )
    precision_ict(ict_survey, 
      mean(speed_quanti), 
      by = division
    )
    
    # Domain may differ from one estimator to another
    precision_ict(ict_survey, 
      "Mean turnover, firms with 50 employees or more" = mean(turnover, where = employees >= 50),
      "Mean turnover, firms with 100 employees or more" = mean(turnover, where = employees >= 100)
    )
    
  • handy variable evaluation

    # On-the-fly evaluation (e.g. discretization)
    precision_ict(ict_survey, mean(speed_quanti > 100))
    
    # Automatic discretization for qualitative (character or factor) variables
    precision_ict(ict_survey, mean(speed_quali))
    
    # Standard evaluation capabilities
    variables_of_interest <- c("speed_quanti", "speed_quali")
    precision_ict(ict_survey, mean(variables_of_interest))
    
  • Integration with %>% and dplyr

    library(dplyr)
    ict_survey %>% 
      precision_ict("Internet speed above 100 Mbps" = mean(speed_quanti > 100)) %>% 
      select(label, est, lower, upper)
    

Colophon

This software is an R package developed with the RStudio IDE and the devtools, roxygen2 and testthat packages. Much help was found in R packages and Advanced R both written by Hadley Wickham.

From the methodological point of view, this package is related to the Poulpe SAS macro (in French) developed at the French statistical institute. From the implementation point of view, some inspiration was found in the ggplot2 package. The idea of developing an R package on this specific topic was stimulated by the icarus package and its author.

News

0.4.0

  • Breaking: Heavy remanufacturing of define_variance_wrapper

    • New: technical_data argument offers a more consistent way to include technical data within the enclosing environment of the wrapper. objects_to_include is kept for non-data objects (such as additional statistic wrappers) or advanced customization.
    • New: technical_param argument offers a more convenient way to specify default values for parameters used by the variance function.
    • New: reference_weight replaces default$weight. This means that the reference weight used for point estimation and linearization is set while defining the variance wrapper and not at run-time.
    • Deprecated: stat, which was a remain of an early implementation of linearization functions, is not a parameter of the variance wrappers anymore. Its purpose (to apply a given variance wrapper to several variables without having to type the name of the linearization wrapper) is now covered by the standard evaluation capabilities of statistic wrappers (see below).
    • Deprecated: default is replaced by default_id, as default$weight and default$stat are no longer needed. As for default$alpha, its value is set to 0.05 and cannot be changed anymore while defining the variance wrapper (as this can easily be done afterwards using formals<-).
    • Deprecated: objects_to_include_from
  • Breaking: Rebranding and heavy remanufacturing of define_statistic_wrapper (previously known as define_linearization_wrapper), added support for standard evaluation (see define_variance_wrapper examples).

  • New: the qvar function allows for a straigthforward variance estimation in common cases (stratified simple random sampling with non-response through reweighting and calibration) and performs both technical and methodological checks.

  • Some normalization in function names: add0 becomes add_zero, sumby becomes sum_by, rescal becomes res_cal

  • Example data: calibration variables in ict_sample instead of ict_survey, new LFS example data

  • Significant increase of unit tests

0.3.1

  • Hotfix: Add calibrated weights to define_variance_wrapper example.

0.3.0

  • Simulated data added
  • Significant increase of unit tests
  • Documentation completed
  • Simplification of the structure of the main object processed by the variance wrapper
  • Removal of unnecessary arguments in linearization wrappers
  • Removal of the linerization wrappers for the Laeken indicators based on the vardpoor package (better integration in a future release)
  • Preparation for a first CRAN release

0.2.7

  • Now linearization with all data parameters set to NULL are discarded from the estimation.

0.2.6

  • Bug fix: evaluation of variables can occur either in the data argument or in the evaluation environment (envir argument)

0.2.3-0.2.5

  • Several attempts to output more metadata from linearization functions.
  • At the end : ratio() gains two metadata slots, est_num and est_denom

0.2.2

  • Minor bug fixes

0.2.1

  • Beginning of the documentation
  • Renaming of numerous functions and arguments
  • Change the precalc structure in varDT
  • Normalize the treatment of weights
  • New linearization wrappers: gini() and arpr()

0.1.3-0.1.7

  • No more dependency to package pryr
  • Add the generalized inverse in varDT
  • Other bug fixes

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("gustave")

0.4.0 by Martin Chevalier, 5 months ago


https://github.com/martinchevalier/gustave


Report a bug at https://github.com/martinchevalier/gustave/issues


Browse source code at https://github.com/cran/gustave


Authors: Martin Chevalier [aut, cre, cph]


Documentation:   PDF Manual  


GPL-3 license


Imports methods, utils, stats, Matrix

Suggests testthat, sampling, magrittr, tibble, dplyr, data.table


See at CRAN