Use piping, verbs like 'group_by' and 'summarize', and other 'dplyr' inspired syntactic style when calculating summary statistics on survey data using functions from the 'survey' package.
srvyr brings parts of dplyr's syntax to survey analysis, using the survey package.
srvyr focuses on calculating summary statistics from survey data, such as the mean, total or quantile. It allows for the use of many dplyr verbs, such as summarize
, group_by
, and mutate
, the convenience of pipe-able functions, rlang's style of non-standard evaluation and more consistent return types than the survey package.
You can try it out:
install.packages("srvyr")# devtools::install_github("gergness/srvyr")
First, describe the variables that define the survey's stucture with the function as_survey()
with the bare column names of the names that you would use in functions from the survey package like survey::svydesign()
, survey::svrepdesign()
or survey::twophase()
.
library(srvyr, warn.conflicts = FALSE)data(api, package = "survey")dstrata <- apistrat %>%as_survey_design(strata = stype, weights = pw)
Now many of the dplyr verbs are available.
mutate()
adds or modifies a variable.dstrata <- dstrata %>%mutate(api_diff = api00 - api99)
summarise()
calculates summary statistics such as mean, total, quantile or ratio.dstrata %>%summarise(api_diff = survey_mean(api_diff, vartype = "ci"))#> # A tibble: 1 x 3#> api_diff api_diff_low api_diff_upp#> <dbl> <dbl> <dbl>#> 1 32.9 28.8 37.0
group_by()
and then summarise()
creates summaries by groups.dstrata %>%group_by(stype) %>%summarise(api_diff = survey_mean(api_diff, vartype = "ci"))#> # A tibble: 3 x 4#> stype api_diff api_diff_low api_diff_upp#> <fct> <dbl> <dbl> <dbl>#> 1 E 38.6 33.1 44.0#> 2 H 8.46 1.74 15.2#> 3 M 26.4 20.4 32.4
my_model <- survey::svyglm(api99 ~ stype, dstrata)summary(my_model)#>#> Call:#> svyglm(formula = api99 ~ stype, dstrata)#>#> Survey design:#> Called via srvyr#>#> Coefficients:#> Estimate Std. Error t value Pr(>|t|)#> (Intercept) 635.87 13.34 47.669 <2e-16 ***#> stypeH -18.51 20.68 -0.895 0.372#> stypeM -25.67 21.42 -1.198 0.232#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> (Dispersion parameter for gaussian family taken to be 16409.56)#>#> Number of Fisher Scoring iterations: 2
-- Kieran Healy, in Data Visualization: A practical introduction
- Yay!
--Thomas Lumley, in the Biased and Inefficent blog
I do appreciate bug reports, suggestions and pull requests! I started this as a way to learn about R package development, and am still learning, so you'll have to bear with me. Please review the Contributor Code of Conduct, as all participants are required to abide by its terms.
If you're unfamiliar with contributing to an R package, I recommend the guides provided by Rstudio's tidyverse team, such as Jim Hester's blog post or Hadley Wickham's R packages book.
survey_mean/survey_total allow deff="replace"
like their survey package
forbearers. (#46, thanks @mandes95)
Fixes for new release of dplyr
Add warning to explain that design effects cannot be calculated on proportions. (#39, thanks @mlaviolet)
Remove dependency on stringr in tests and add DBI to suggests so that test dependencies are correctly specified (#40, thanks CRAN!)
When converting from a survey db-backed survey to a srvyr one srvyr now tries to capture the updates you've already sent. If dbplyr can convert the function, then it will bring the update. If it can't it will warn you (#35).
Small bug fixes, mostly having to do with CRAN checks, running on CI services, or for upstream rev dep checks.
srvyr now uses tidy evaluation from rlang. The "underscore" functions
have been soft deprecated in favor of quosure splicing. See dplyr's
vignette "programming" for more details. In almost all cases, the old syntax
will still work, with one exception: the standard
evaluation function as_survey_twophase_()
had to be changed slightly
so that the entire list is inside quotation.
Datbase support has been rewritten. It should be faster now and doesn't require a unique identifier. You also can now convert survey db-backed surveys to srvyr with as_survey.
srvyr now has a pkgdown site, check it out at http://gdfe.co/srvyr
Added support for dplyr mutate_at/_if/_all and summarize_at/_if/_all for srvyr surveys.
Fixed a few bugs introduced with dplyr 0.6. This version of srvyr will work with both old versions of dplyr and 0.6, but may be full of warnings if you update dplyr. Full support for the new dplyr is coming soon.
Fixed a problem with confidence levels not being passed into quantiles
Added deff parameter to survey_mean()
, survey_total()
and survey_median()
, and
a df parameter to those functions and survey_quantile()
/ survey_median()
.
summarize
and mutate
match dplyr's behavior when arguments aren't named
(uses dplyr::auto_name()
)
New function cascade
summarizes groups, and cascades to create
summary statistics of groups of groups.
Fixed a bug for confidence intervals for survey_total()
on groups.
Fixed some issues with the upcoming version of dplyr.