Grouped Statistical Analyses in a Tidy Way

Collection of functions to run statistical tests across all levels of multiple grouping variables.


CRAN_Release_Badge CRANChecks packageversion Daily downloadsbadge Weekly downloadsbadge Monthly downloadsbadge Total downloadsbadge Travis BuildStatus AppVeyor BuildStatus CircleCI Licence DOI Project Status: Active - The project has reached a stable, usablestate and is being activelydeveloped. Last-changedate lifecycle minimal Rversion CoverageStatus

Overview

groupedstats package provides a collection of functions to run statistical operations on multiple variables across multiple grouping variables in a dataframe. This is a common situation, as illustrated by few example cases-

  1. If you have combined multiple studies in a single dataframe and want to run a common operation (e.g., linear regression) on each study. In this case, column corresponding to study will be the grouping variable.
  2. If you have multiple groups in your dataframe (e.g., clinical disorder groups and controls group) and you want to carry out the same operation for each group (e.g., planned t-test to check for differences in reaction time in condition 1 versus condition 2 for both groups). In this case, group will be the grouping variable.
  3. If you have multiple conditions in a given study (e.g., six types of videos participants saw) and want to run a common operation between different measures of interest for each condition (e.g., correlation between subjective rating of emotional intensity and reaction time).
  4. Combination of all of the above.

This package is still work in progress and it currently supports only the most basic statistical operations (from stats and lme4 package). The next releases will expand on the existing functionality (e.g., ordinal).

Installation

To get the latest, stable CRAN release (0.0.5):

utils::install.packages(pkgs = "groupedstats") 

You can get the development version of the package from GitHub (0.0.5.9000). To see what new changes (and bug fixes) have been made to the package since the last release on CRAN, you can check the detailed log of changes here: https://indrajeetpatil.github.io/groupedstats/news/index.html

If you are in hurry and want to reduce the time of installation, prefer-

# needed package to download from GitHub repo
utils::install.packages(pkgs = "devtools")                 
 
devtools::install_github(repo = "IndrajeetPatil/groupedstats",  # package path on GitHub
                         quick = TRUE)                          # skips docs, demos, and vignettes

If time is not a constraint-

devtools::install_github(repo = "IndrajeetPatil/groupedstats", # package path on GitHub
                         dependencies = TRUE,                  # installs packages which groupedstats depends on
                         upgrade_dependencies = TRUE           # updates any out of date dependencies
)

Help

There is a dedicated website to groupedstats, which is updated after every new commit: https://indrajeetpatil.github.io/groupedstats/.

In R, documentation for any function can be accessed with the standard help command-

?grouped_aov
?grouped_lm
?grouped_lmer
?grouped_glm
?grouped_glmer
?grouped_summary
?grouped_slr
?grouped_robustslr
?grouped_proptest
?grouped_ttest
?grouped_wilcox

Another handy tool to see arguments to any of the functions is args. For example-

args(name = groupedstats::grouped_ttest)
#> function (data, dep.vars, indep.vars, grouping.vars, paired = FALSE, 
#>     var.equal = FALSE) 
#> NULL

In case you want to look at the function body for any of the functions, just type the name of the function without the parentheses:

groupedstats::grouped_lm

If you are not familiar either with what the namespace :: does or how to use pipe operator %>%, something this package and its documentation relies a lot on, you can check out these links-

Usage

grouped_summary

Getting summary for multiple variables across multiple grouping variables. This function is a wrapper around skimr::skim_to_wide(). It is supposed to be a handy summarizing tool if you have multiple grouping variables and multiple variables for which summary statistics are to be computed-

library(datasets)
options(tibble.width = Inf)            # show me all columns
 
groupedstats::grouped_summary(data = datasets::iris,
                              grouping.vars = Species,
                              measures = Sepal.Length:Petal.Width,
                              measures.type = "numeric")
#> # A tibble: 12 x 16
#>    Species    variable     type    missing complete     n  mean    sd   min
#>    <fct>      <chr>        <chr>     <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 setosa     Sepal.Length numeric       0       50    50 5.01  0.352   4.3
#>  2 setosa     Sepal.Width  numeric       0       50    50 3.43  0.379   2.3
#>  3 setosa     Petal.Length numeric       0       50    50 1.46  0.174   1  
#>  4 setosa     Petal.Width  numeric       0       50    50 0.246 0.105   0.1
#>  5 versicolor Sepal.Length numeric       0       50    50 5.94  0.516   4.9
#>  6 versicolor Sepal.Width  numeric       0       50    50 2.77  0.314   2  
#>  7 versicolor Petal.Length numeric       0       50    50 4.26  0.470   3  
#>  8 versicolor Petal.Width  numeric       0       50    50 1.33  0.198   1  
#>  9 virginica  Sepal.Length numeric       0       50    50 6.59  0.636   4.9
#> 10 virginica  Sepal.Width  numeric       0       50    50 2.97  0.322   2.2
#> 11 virginica  Petal.Length numeric       0       50    50 5.55  0.552   4.5
#> 12 virginica  Petal.Width  numeric       0       50    50 2.03  0.275   1.4
#>      p25 median   p75   max std.error mean.low.conf mean.high.conf
#>    <dbl>  <dbl> <dbl> <dbl>     <dbl>         <dbl>          <dbl>
#>  1  4.8    5     5.2    5.8    0.0498         4.91           5.11 
#>  2  3.2    3.4   3.68   4.4    0.0536         3.32           3.54 
#>  3  1.4    1.5   1.58   1.9    0.0246         1.41           1.51 
#>  4  0.2    0.2   0.3    0.6    0.0149         0.216          0.276
#>  5  5.6    5.9   6.3    7      0.0730         5.79           6.08 
#>  6  2.52   2.8   3      3.4    0.0444         2.68           2.86 
#>  7  4      4.35  4.6    5.1    0.0665         4.13           4.39 
#>  8  1.2    1.3   1.5    1.8    0.0280         1.27           1.38 
#>  9  6.22   6.5   6.9    7.9    0.0899         6.41           6.77 
#> 10  2.8    3     3.18   3.8    0.0456         2.88           3.07 
#> 11  5.1    5.55  5.88   6.9    0.0780         5.40           5.71 
#> 12  1.8    2     2.3    2.5    0.0388         1.95           2.10

This function can be used to get summary of either numeric or factor variables, but not both. This is by design. If no measures are specified, the function will compute summary for all variables of the specified type (numeric or factor).

If you want summary of variables of factor type-

library(ggplot2)
options(tibble.width = Inf)            # show me all columns
 
groupedstats::grouped_summary(data = ggplot2::diamonds,
                              grouping.vars = c(cut, clarity),
                              measures = color,
                              measures.type = "factor")
#> # A tibble: 40 x 10
#>    cut       clarity variable type   missing complete     n ordered
#>    <ord>     <ord>   <chr>    <chr>    <int>    <int> <int> <lgl>  
#>  1 Ideal     SI2     color    factor       0     2598  2598 TRUE   
#>  2 Premium   SI1     color    factor       0     3575  3575 TRUE   
#>  3 Good      VS1     color    factor       0      648   648 TRUE   
#>  4 Premium   VS2     color    factor       0     3357  3357 TRUE   
#>  5 Good      SI2     color    factor       0     1081  1081 TRUE   
#>  6 Very Good VVS2    color    factor       0     1235  1235 TRUE   
#>  7 Very Good VVS1    color    factor       0      789   789 TRUE   
#>  8 Very Good SI1     color    factor       0     3240  3240 TRUE   
#>  9 Fair      VS2     color    factor       0      261   261 TRUE   
#> 10 Very Good VS1     color    factor       0     1775  1775 TRUE   
#>    n_unique top_counts                    
#>       <int> <chr>                         
#>  1        7 G: 486, E: 469, F: 453, H: 450
#>  2        7 H: 655, E: 614, F: 608, G: 566
#>  3        7 G: 152, F: 132, I: 103, E: 89 
#>  4        7 G: 721, E: 629, F: 619, H: 532
#>  5        7 D: 223, E: 202, F: 201, G: 163
#>  6        7 G: 302, E: 298, F: 249, H: 145
#>  7        7 G: 190, F: 174, E: 170, H: 115
#>  8        7 E: 626, F: 559, H: 547, D: 494
#>  9        7 F: 53, G: 45, E: 42, H: 41    
#> 10        7 G: 432, E: 293, F: 293, H: 257
#> # ... with 30 more rows

Note that there is a column corresponding to top_counts which is really useful in case you, let’s say, want to plot these counts. But this column is of character type and in wide format. The solution is to use an additional argument provided for this function:

library(ggplot2)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:rlang':
#> 
#>     set_names
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(ggstatsplot)
 
options(tibble.width = Inf)            # show me all columns
 
groupedstats::grouped_summary(
  data = ggplot2::diamonds,
  grouping.vars = cut,                          # for simplicity, let's just use one grouping variable
  measures = color,
  measures.type = "factor",
  topcount.long = TRUE
) %>% 
  ggplot2::ggplot(
  data = .,                                     # placeholder for summary dataframe we just created 
  mapping = ggplot2::aes(
    x = forcats::fct_inorder(f = factor.level),
    y = count,
    fill = factor.level
  )
) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::labs(x = "color", y = "count") +
  ggplot2::facet_grid(facets = ~ cut) +         # for each level of the factor level
  ggstatsplot::theme_mprl() +
  ggplot2::theme(legend.position = "none")
#> Joining, by = "cut"

This produces a long format table with two new columns factor.level and its corresponding count, which can then be immediately fed into other pipelines, e.g., preparing a plot of mean and sd values in ggplot2).

options(tibble.width = Inf)            # show me all columns
 
groupedstats::grouped_summary(
  data = ggplot2::diamonds,
  grouping.vars = c(cut, clarity)
)
#> # A tibble: 280 x 17
#>    cut     clarity variable type    missing complete     n     mean
#>    <ord>   <ord>   <chr>    <chr>     <dbl>    <dbl> <dbl>    <dbl>
#>  1 Ideal   SI2     carat    numeric       0     2598  2598    1.01 
#>  2 Ideal   SI2     depth    numeric       0     2598  2598   61.7  
#>  3 Ideal   SI2     table    numeric       0     2598  2598   56.1  
#>  4 Ideal   SI2     price    numeric       0     2598  2598 4756.   
#>  5 Ideal   SI2     x        numeric       0     2598  2598    6.26 
#>  6 Ideal   SI2     y        numeric       0     2598  2598    6.27 
#>  7 Ideal   SI2     z        numeric       0     2598  2598    3.87 
#>  8 Premium SI1     carat    numeric       0     3575  3575    0.909
#>  9 Premium SI1     depth    numeric       0     3575  3575   61.3  
#> 10 Premium SI1     table    numeric       0     3575  3575   58.8  
#>          sd    min     p25  median     p75      max std.error mean.low.conf
#>       <dbl>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>     <dbl>         <dbl>
#>  1    0.509   0.23    0.62    1       1.2      3.01   0.01000         0.988
#>  2    0.824  58.3    61.2    61.8    62.3     65.5    0.0162         61.7  
#>  3    1.30   52      55      56      57       62      0.0254         56.1  
#>  4 4252.    326    1443    4060.   5402.   18804     83.4          4592.   
#>  5    1.07    0       5.50    6.4     6.82     9.25   0.0210          6.22 
#>  6    1.05    3.98    5.53    6.4     6.82     9.2    0.0207          6.23 
#>  7    0.662   0       3.38    3.95    4.21     5.69   0.0130          3.84 
#>  8    0.481   0.21    0.5     0.9     1.15     2.57   0.00804         0.893
#>  9    1.17   58      60.5    61.5    62.3     63      0.0196         61.3  
#> 10    1.53   51      58      59      60       62      0.0255         58.7  
#>    mean.high.conf
#>             <dbl>
#>  1          1.03 
#>  2         61.7  
#>  3         56.2  
#>  4       4920.   
#>  5          6.30 
#>  6          6.31 
#>  7          3.89 
#>  8          0.924
#>  9         61.3  
#> 10         58.8  
#> # ... with 270 more rows

grouped_slr

This function can be used to run simple linear regression (slr) between different pairs of variables across multiple levels of grouping variable(s). For example, we can use the gapminder dataset to study two relationships of interest for each country across years:

  1. life expectancy and GDP (per capita)
  2. population GDP (per capita) Thus, in this case we have two regression models and one grouping variable with 142 levels (countries)
library(gapminder)
options(tibble.width = Inf)            # show me all columns
 
groupedstats::grouped_slr(data = gapminder::gapminder,
                         dep.vars = c(lifeExp, pop),
                         indep.vars = c(gdpPercap, gdpPercap),
                         grouping.vars = country)
#> # A tibble: 284 x 9
#>    country     formula             t.value estimate conf.low conf.high
#>    <fct>       <chr>                 <dbl>    <dbl>    <dbl>     <dbl>
#>  1 Afghanistan lifeExp ~ gdpPercap  -0.151  -0.0475   -0.751     0.656
#>  2 Albania     lifeExp ~ gdpPercap   4.84    0.837     0.452     1.22 
#>  3 Algeria     lifeExp ~ gdpPercap   6.71    0.904     0.604     1.21 
#>  4 Angola      lifeExp ~ gdpPercap  -0.998  -0.301    -0.973     0.371
#>  5 Argentina   lifeExp ~ gdpPercap   4.74    0.832     0.440     1.22 
#>  6 Australia   lifeExp ~ gdpPercap  19.0     0.986     0.871     1.10 
#>  7 Austria     lifeExp ~ gdpPercap  26.5     0.993     0.910     1.08 
#>  8 Bahrain     lifeExp ~ gdpPercap   6.45    0.898     0.587     1.21 
#>  9 Bangladesh  lifeExp ~ gdpPercap   5.05    0.847     0.473     1.22 
#> 10 Belgium     lifeExp ~ gdpPercap  26.1     0.993     0.908     1.08 
#>    std.error  p.value significance
#>        <dbl>    <dbl> <chr>       
#>  1    0.316  8.83e- 1 ns          
#>  2    0.173  6.82e- 4 ***         
#>  3    0.135  5.33e- 5 ***         
#>  4    0.302  3.42e- 1 ns          
#>  5    0.176  7.97e- 4 ***         
#>  6    0.0519 3.52e- 9 ***         
#>  7    0.0374 1.34e-10 ***         
#>  8    0.139  7.38e- 5 ***         
#>  9    0.168  5.03e- 4 ***         
#> 10    0.0380 1.56e-10 ***         
#> # ... with 274 more rows

Notice the order in which the dependent and independent variables are entered; there are two separate regression models being run here: lifeExp ~ gdpPercap and pop ~ gdpPercap If this order is incorrect, the result will also be incorrect. So it is always a good idea to check the formula column to see if you have run the correct linear models. Also, note that the estimates are already standardized, i.e. estimates are standardized regression coefficients (betas, i.e.).

The prior example was with just one grouping variable. This can be done with multiple grouping variables as well. For example, with the diamonds dataset from ggplot2 library, let’s assess the relation between carat and price of a diamond for each type of clarity and cut-

library(ggplot2)
library(dplyr)
options(tibble.width = Inf)            # show me all columns
 
groupedstats::grouped_slr(data = ggplot2::diamonds,
                         dep.vars = price,
                         indep.vars = carat,
                         grouping.vars = c(cut, clarity)) %>%
  dplyr::arrange(.data = ., cut)
#> # A tibble: 40 x 10
#>    cut   clarity formula       t.value estimate conf.low conf.high
#>    <ord> <ord>   <chr>           <dbl>    <dbl>    <dbl>     <dbl>
#>  1 Fair  VS2     price ~ carat   42.0     0.934    0.890     0.978
#>  2 Fair  SI2     price ~ carat   69.1     0.955    0.928     0.982
#>  3 Fair  SI1     price ~ carat   58.9     0.946    0.915     0.978
#>  4 Fair  I1      price ~ carat   58.7     0.971    0.939     1.00 
#>  5 Fair  VVS1    price ~ carat    8.58    0.911    0.685     1.14 
#>  6 Fair  VS1     price ~ carat   36.6     0.943    0.892     0.994
#>  7 Fair  IF      price ~ carat    8.22    0.952    0.678     1.23 
#>  8 Fair  VVS2    price ~ carat   13.6     0.857    0.732     0.983
#>  9 Good  VS1     price ~ carat   74.0     0.946    0.921     0.971
#> 10 Good  SI2     price ~ carat  103.      0.953    0.934     0.971
#>    std.error   p.value significance
#>        <dbl>     <dbl> <chr>       
#>  1   0.0222  1.55e-117 ***         
#>  2   0.0138  1.96e-246 ***         
#>  3   0.0161  4.58e-201 ***         
#>  4   0.0165  1.86e-131 ***         
#>  5   0.106   3.59e-  7 ***         
#>  6   0.0257  5.40e- 82 ***         
#>  7   0.116   7.67e-  5 ***         
#>  8   0.0629  5.54e- 21 ***         
#>  9   0.0128  9.62e-318 ***         
#> 10   0.00926 0.        ***         
#> # ... with 30 more rows

A more general version of this function (grouped_lm) will be implemented in future that will utilize the formula interface of stats::lm.

grouped_robustslr

There is also robust variant of simple linear regression (as implemented in robust::lmRob)-

library(gapminder)
library(dplyr)
options(tibble.width = Inf) # show me all columns
 
groupedstats::grouped_robustslr(
  data = gapminder::gapminder,
  dep.vars = c(lifeExp, pop),
  indep.vars = c(gdpPercap, gdpPercap),
  grouping.vars = c(continent, country)
) %>%
  dplyr::arrange(.data = ., continent, country)
#> # A tibble: 104 x 8
#>    continent country      formula             t.value estimate std.error
#>    <fct>     <fct>        <chr>                 <dbl>    <dbl>     <dbl>
#>  1 Africa    Algeria      lifeExp ~ gdpPercap   2.46     0.773    0.315 
#>  2 Africa    Algeria      pop ~ gdpPercap       7.18     0.929    0.129 
#>  3 Africa    Angola       lifeExp ~ gdpPercap   4.42     1.47     0.332 
#>  4 Africa    Angola       pop ~ gdpPercap      15.3      1.08     0.0706
#>  5 Africa    Benin        lifeExp ~ gdpPercap  -0.387   -0.225    0.581 
#>  6 Africa    Benin        pop ~ gdpPercap      -4.41    -0.824    0.187 
#>  7 Africa    Botswana     lifeExp ~ gdpPercap  -1.86    -0.493    0.265 
#>  8 Africa    Botswana     pop ~ gdpPercap      -0.652   -0.721    1.11  
#>  9 Africa    Burkina Faso lifeExp ~ gdpPercap  13.0      1.07     0.0822
#> 10 Africa    Burkina Faso pop ~ gdpPercap       8.58     1.04     0.121 
#>         p.value significance
#>           <dbl> <chr>       
#>  1 0.0338       *           
#>  2 0.0000299    ***         
#>  3 0.00130      **          
#>  4 0.0000000294 ***         
#>  5 0.707        ns          
#>  6 0.00132      **          
#>  7 0.0926       ns          
#>  8 0.529        ns          
#>  9 0.000000141  ***         
#> 10 0.00000637   ***         
#> # ... with 94 more rows

A more general version of this function (grouped_robustlm) will be implemented in future that will utilize the formula interface of robust::lmRob.

grouped_lm

A more general version of simple linear regression is stats::lm, implemented in grouped_lm:

library(groupedstats)
#> 
#> Attaching package: 'groupedstats'
#> The following objects are masked from 'package:ggstatsplot':
#> 
#>     movies_long, movies_wide
 
groupedstats::grouped_lm(
  data = mtcars,
  grouping.vars = cyl,        # grouping variable (just one in this case)
  formula = mpg ~ am*wt,      # note that this function takes a formula
  output = "tidy"             # tidy dataframe containing results
)
#> # A tibble: 12 x 9
#>      cyl term        estimate std.error t.value conf.low conf.high
#>    <dbl> <chr>          <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
#>  1     6 (Intercept)   63.6      14.1    4.51      18.7    109.   
#>  2     6 am           -41.4      19.0   -2.18    -102.      19.1  
#>  3     6 wt           -13.1       4.16  -3.16     -26.4      0.113
#>  4     6 am:wt         12.5       6.22   2.02      -7.26    32.4  
#>  5     4 (Intercept)   13.9      16.1    0.865    -24.1     51.9  
#>  6     4 am            30.3      17.2    1.77     -10.3     70.9  
#>  7     4 wt             3.07      5.44   0.564     -9.79    15.9  
#>  8     4 am:wt        -11.0       6.16  -1.78     -25.5      3.61 
#>  9     8 (Intercept)   25.1       3.51   7.14      17.2     32.9  
#> 10     8 am            -2.92     25.9   -0.113    -60.5     54.7  
#> 11     8 wt            -2.44      0.842 -2.90      -4.32    -0.563
#> 12     8 am:wt          0.439     7.63   0.0575   -16.6     17.4  
#>      p.value significance
#>        <dbl> <chr>       
#>  1 0.0204    *           
#>  2 0.117     ns          
#>  3 0.0511    ns          
#>  4 0.137     ns          
#>  5 0.416     ns          
#>  6 0.121     ns          
#>  7 0.590     ns          
#>  8 0.118     ns          
#>  9 0.0000315 ***         
#> 10 0.912     ns          
#> 11 0.0159    *           
#> 12 0.955     ns

The same function can also be used to get model summaries instead of a tidy dataframe containing results-

library(groupedstats)
 
groupedstats::grouped_lm(
  data = ggplot2::diamonds,
  grouping.vars = c(cut, color),            # grouping variables
  formula = price ~ carat * clarity,        # formula
  output = "glance"                         # dataframe with model summaries
)
#> # A tibble: 35 x 14
#>    cut       color r.squared adj.r.squared sigma statistic    df  logLik
#>    <ord>     <ord>     <dbl>         <dbl> <dbl>     <dbl> <int>   <dbl>
#>  1 Ideal     E         0.931         0.931  776.     3516.    16 -31501.
#>  2 Premium   E         0.927         0.927 1028.     1968.    16 -19516.
#>  3 Good      E         0.927         0.926  905.      781.    16  -7667.
#>  4 Premium   I         0.934         0.934 1303.     1338.    16 -12260.
#>  5 Good      J         0.946         0.943  885.      363.    15  -2511.
#>  6 Very Good J         0.957         0.957  862.      994.    16  -5537.
#>  7 Very Good I         0.946         0.945 1100.     1378.    16 -10132.
#>  8 Very Good H         0.940         0.939 1032.     1880.    16 -15237.
#>  9 Fair      E         0.917         0.912  883.      179.    14  -1830.
#> 10 Ideal     J         0.955         0.955  953.     1259.    16  -7409.
#>       AIC    BIC    deviance df.residual   p.value significance
#>     <dbl>  <dbl>       <dbl>       <int>     <dbl> <chr>       
#>  1 63036. 63142. 2340281648.        3887 0.        ***         
#>  2 39066. 39163. 2452401685.        2321 0.        ***         
#>  3 15369. 15451.  750592238.         917 0.        ***         
#>  4 24554. 24644. 2396212809.        1412 0.        ***         
#>  5  5054.  5114.  228805662.         292 2.70e-175 ***         
#>  6 11108. 11185.  492442592.         662 0.        ***         
#>  7 20297. 20384. 1436458762.        1188 0.        ***         
#>  8 30508. 30601. 1924549052.        1808 0.        ***         
#>  9  3690.  3741.  163677866.         210 8.70e-106 ***         
#> 10 14852. 14934.  798454816.         880 0.        ***         
#> # ... with 25 more rows

grouped_aov

A related function to stats::lm is stats::aov, which fits an analysis of variance model for each group. Contrast the output you get here with the previous output for the same model from grouped_lm function. The estimate in this case with be an effect size (either partial eta-squared or partial omega-squared).

library(groupedstats)
 
groupedstats::grouped_aov(
  data = mtcars,
  grouping.vars = cyl,                 # grouping variable (just one in this case)
  formula = mpg ~ am * wt,             # note that this function takes a formula
  output = "tidy"                      # tidy dataframe with results
)
#> # A tibble: 9 x 10
#>     cyl term  F.value   df1   df2 partial.etasq conf.low conf.high p.value
#>   <dbl> <chr>   <dbl> <dbl> <dbl>         <dbl>    <dbl>     <dbl>   <dbl>
#> 1     6 am    5.07        1     3      0.628      0         0.820   0.110 
#> 2     6 wt    5.91        1     3      0.663      0         0.835   0.0933
#> 3     6 am:wt 4.06        1     3      0.575      0         0.796   0.137 
#> 4     4 am    5.95        1     7      0.459      0         0.711   0.0448
#> 5     4 wt    4.59        1     7      0.396      0         0.676   0.0693
#> 6     4 am:wt 3.17        1     7      0.311      0         0.627   0.118 
#> 7     8 am    0.0456      1    10      0.00454    0         0.242   0.835 
#> 8     8 wt    8.45        1    10      0.458      0.0190    0.691   0.0156
#> 9     8 am:wt 0.00331     1    10      0.000330   0         0.0885  0.955 
#>   significance
#>   <chr>       
#> 1 ns          
#> 2 ns          
#> 3 ns          
#> 4 *           
#> 5 ns          
#> 6 ns          
#> 7 ns          
#> 8 *           
#> 9 ns

The same function can also be used to compute Tukey’s test of Honest Significant Differences (HSD). For example, we can check for differences in life expectancy between different continents for all years for which the gapminder survey was conducted:

library(groupedstats)
library(gapminder)
 
groupedstats::grouped_aov(
  data = gapminder::gapminder,
  grouping.vars = year,
  formula = lifeExp ~ continent,
  output = "tukey"
)
#> Note: The p-value is adjusted for the number of tests conducted at each level of the grouping variable.
#> 
#> # A tibble: 120 x 8
#>     year term      comparison       estimate conf.low conf.high adj.p.value
#>    <int> <chr>     <chr>               <dbl>    <dbl>     <dbl>       <dbl>
#>  1  1952 continent Americas-Africa     14.1      9.21     19.1     7.35e-12
#>  2  1952 continent Asia-Africa          7.18     2.66     11.7     2.11e- 4
#>  3  1952 continent Europe-Africa       25.3     20.6      29.9     1.01e-14
#>  4  1952 continent Oceania-Africa      30.1     15.5      44.7     7.12e- 7
#>  5  1952 continent Asia-Americas       -6.97   -12.3      -1.59    4.28e- 3
#>  6  1952 continent Europe-Americas     11.1      5.64     16.6     1.12e- 6
#>  7  1952 continent Oceania-Americas    16.0      1.07     30.9     2.91e- 2
#>  8  1952 continent Europe-Asia         18.1     13.0      23.2     5.87e-14
#>  9  1952 continent Oceania-Asia        22.9      8.17     37.7     3.16e- 4
#> 10  1952 continent Oceania-Europe       4.85    -9.97     19.7     8.95e- 1
#>    significance
#>    <chr>       
#>  1 ***         
#>  2 ***         
#>  3 ***         
#>  4 ***         
#>  5 **          
#>  6 ***         
#>  7 *           
#>  8 ***         
#>  9 ***         
#> 10 ns          
#> # ... with 110 more rows

Note that the p-value is adjusted adjusted for the number of tests conducted at each level of the grouping variable, and not across all tests conducted.

grouped_glm

The option to run generalized linear model (stats::glm) across different levels of the grouping variable is also implemented similarly-

groupedstats::grouped_glm(
 data = ggstatsplot::Titanic_full,
 formula = Survived ~ Sex,
 grouping.vars = Class,
 family = stats::binomial(link = "logit"),
 output = "tidy"
)
#> # A tibble: 8 x 9
#>   Class term        estimate std.error statistic conf.low conf.high
#>   <fct> <chr>          <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
#> 1 3rd   (Intercept)   -0.164     0.143     -1.14   -0.446     0.117
#> 2 3rd   SexMale       -1.40      0.185     -7.58   -1.77     -1.04 
#> 3 1st   (Intercept)    3.56      0.507      7.03    2.70      4.74 
#> 4 1st   SexMale       -4.21      0.531     -7.92   -5.42     -3.29 
#> 5 2nd   (Intercept)    1.97      0.296      6.65    1.43      2.60 
#> 6 2nd   SexMale       -3.79      0.366    -10.3    -4.54     -3.10 
#> 7 Crew  (Intercept)    1.90      0.619      3.06    0.827     3.34 
#> 8 Crew  SexMale       -3.15      0.625     -5.04   -4.60     -2.06 
#>    p.value significance
#>      <dbl> <chr>       
#> 1 2.54e- 1 ns          
#> 2 3.36e-14 ***         
#> 3 2.13e-12 ***         
#> 4 2.29e-15 ***         
#> 5 3.03e-11 ***         
#> 6 4.88e-25 ***         
#> 7 2.18e- 3 **          
#> 8 4.68e- 7 ***

Note that the statistic will either be a t (gaussian, e.g.) or a z (binomial, e.g.) value, based on the family of models used.

You can also get a model summary across all models with broom::glance methods-

groupedstats::grouped_glm(
 data = ggstatsplot::Titanic_full,
 formula = Survived ~ Sex,
 grouping.vars = Class,
 family = stats::binomial(link = "logit"),
 output = "glance"
)
#> # A tibble: 4 x 8
#>   Class null.deviance df.null logLik   AIC   BIC deviance df.residual
#>   <fct>         <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
#> 1 3rd            797.     705  -370.  744.  753.     740.         704
#> 2 1st            430.     324  -134.  272.  280.     268.         323
#> 3 2nd            387.     284  -112.  228.  235.     224.         283
#> 4 Crew           974.     884  -466.  936.  946.     932.         883

grouped_lmer

Linear mixed effects analyses (lme4::lmer) for all combinations of grouping variable levels can be carried out using grouped_lmer:

library(gapminder)
 
# getting tidy output of results
groupedstats::grouped_lmer(
  data = gapminder,
  formula = scale(lifeExp) ~ scale(gdpPercap) + (gdpPercap |
                                                   continent),
  grouping.vars = year,
  REML = FALSE,
  output = "tidy"
)
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> # A tibble: 24 x 10
#>     year effect term             estimate std.error t.value conf.low
#>    <int> <chr>  <chr>               <dbl>     <dbl>   <dbl>    <dbl>
#>  1  1952 fixed  (Intercept)         0.215     0.293   0.736   -0.358
#>  2  1952 fixed  scale(gdpPercap)    0.930     0.302   3.07     0.337
#>  3  1957 fixed  (Intercept)         0.254     0.342   0.743   -0.416
#>  4  1957 fixed  scale(gdpPercap)    0.815     0.282   2.89     0.262
#>  5  1962 fixed  (Intercept)         0.255     0.333   0.766   -0.397
#>  6  1962 fixed  scale(gdpPercap)    0.591     0.210   2.82     0.180
#>  7  1967 fixed  (Intercept)         0.249     0.361   0.689   -0.459
#>  8  1967 fixed  scale(gdpPercap)    0.387     0.120   3.24     0.153
#>  9  1972 fixed  (Intercept)         0.276     0.366   0.753   -0.442
#> 10  1972 fixed  scale(gdpPercap)    0.431     0.150   2.88     0.137
#>    conf.high p.value significance
#>        <dbl>   <dbl> <chr>       
#>  1     0.789 0.462   ns          
#>  2     1.52  0.00211 **          
#>  3     0.923 0.457   ns          
#>  4     1.37  0.00389 **          
#>  5     0.908 0.444   ns          
#>  6     1.00  0.00479 **          
#>  7     0.956 0.491   ns          
#>  8     0.622 0.00120 **          
#>  9     0.994 0.451   ns          
#> 10     0.724 0.00402 **          
#> # ... with 14 more rows
 
 
# getting tidy output of results
groupedstats::grouped_lmer(
  data = gapminder,
  formula = scale(lifeExp) ~ scale(gdpPercap) + (gdpPercap |
                                                   continent),
  grouping.vars = year,
  REML = FALSE,
  output = "glance"
)
#> # A tibble: 12 x 7
#>     year sigma logLik   AIC   BIC deviance df.residual
#>    <int> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int>
#>  1  1952 0.531  -124.  259.  277.     247.         136
#>  2  1957 0.540  -127.  266.  284.     254.         136
#>  3  1962 0.546  -128.  268.  285.     256.         136
#>  4  1967 0.552  -128.  269.  287.     257.         136
#>  5  1972 0.565  -132.  276.  294.     264.         136
#>  6  1977 0.582  -132.  277.  294.     265.         136
#>  7  1982 0.544  -121.  253.  271.     241.         136
#>  8  1987 0.498  -110.  231.  249.     219.         136
#>  9  1992 0.518  -117.  246.  264.     234.         136
#> 10  1997 0.498  -109.  231.  248.     219.         136
#> 11  2002 0.505  -113.  238.  255.     226.         136
#> 12  2007 0.525  -116.  244.  262.     232.         136

grouped_glmer

A more generalized version of lmer is implemented in lme4::glmer, which can also handle categorical/nominal data. For example, let’s say we want to see if sex of a person was predictive of whether they survived the Titanic tragedy.

# having a look at the data
dplyr::glimpse(x = groupedstats::Titanic_full)
#> Observations: 2,201
#> Variables: 5
#> $ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
#> $ Class    <fct> 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd...
#> $ Sex      <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male,...
#> $ Age      <fct> Child, Child, Child, Child, Child, Child, Child, Chil...
#> $ Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...
 
# running glmer model to get tidy output
groupedstats::grouped_glmer(
  formula = Survived ~ Age + (Age |
                                Class),
  data = groupedstats::Titanic_full,
  family = stats::binomial(link = "probit"),                  # choosing the appropriate GLM family
                          control = lme4::glmerControl(       # choosing appropriate control
                            optimizer = "Nelder_Mead",
                            boundary.tol = 1e-07,
                            calc.derivs = FALSE,
                            optCtrl = list(maxfun = 2e9)
                          ),
  grouping.vars = Sex,                                        # grouping variables (just one in this case)
  output = "tidy"
)
#> # A tibble: 4 x 10
#>   Sex    effect term        estimate std.error statistic conf.low conf.high
#>   <fct>  <chr>  <chr>          <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
#> 1 Male   fixed  (Intercept)   -0.888     0.162     -5.47   -1.21     -0.570
#> 2 Male   fixed  AgeChild       4.90      4.03       1.22   -3.00     12.8  
#> 3 Female fixed  (Intercept)    1.01      0.379      2.66    0.264     1.75 
#> 4 Female fixed  AgeChild       1.47      0.595      2.48    0.307     2.64 
#>        p.value significance
#>          <dbl> <chr>       
#> 1 0.0000000453 ***         
#> 2 0.224        ns          
#> 3 0.00789      **          
#> 4 0.0133       *
 
# getting glmer model summaries (let's use the default family and control values)
groupedstats::grouped_glmer(
  formula = Survived ~ Age + (Age |
                                Class),
  grouping.vars = Sex,      
  data = groupedstats::Titanic_full,
  output = "glance"
)
#> # A tibble: 2 x 7
#>   Sex    sigma logLik   AIC   BIC deviance df.residual
#>   <fct>  <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int>
#> 1 Male       1  -860. 1730. 1757.    1698.        1726
#> 2 Female     1  -208.  426.  447.     400.         465

Note that the statistic will either be a t (gaussian, e.g.) or a z (binomial, e.g.) value, based on the family of models used.

grouped_proptest

This function helps carry out one-sample proportion tests (stats::chisq.test) with a unique variable for multiple grouping variables-

options(tibble.width = Inf)            # show me all columns
 
groupedstats::grouped_proptest(data = datasets::mtcars,
                               grouping.vars = cyl,
                               measure = am)
#> # A tibble: 3 x 7
#>     cyl `0`    `1`    `Chi-squared`    df `p-value` significance
#>   <dbl> <chr>  <chr>          <dbl> <dbl>     <dbl> <chr>       
#> 1     6 57.14% 42.86%         0.143     1     0.705 ns          
#> 2     4 27.27% 72.73%         2.27      1     0.132 ns          
#> 3     8 85.71% 14.29%         7.14      1     0.008 **

grouped_ttest

This function can help you carry out t-tests, paired or independent, on multiple variables across multiple groups. Demonstrating how to use this function is going to first require getting the iris dataset into long format. Let’s say we want to investigate if Sepal part of the flower has greater measurement (length or width) than Petal part of the flower for each Iris species.

 
# converting the iris dataset to long format
iris_long <- datasets::iris %>%
  dplyr::mutate(.data = ., id = dplyr::row_number(x = Species)) %>%
  tidyr::gather(
    data = .,
    key = "condition",
    value = "value",
    Sepal.Length:Petal.Width,
    convert = TRUE,
    factor_key = TRUE
  ) %>%
  tidyr::separate(
    col = "condition",
    into = c("part", "measure"),
    sep = "\\.",
    convert = TRUE
  ) %>%
  tibble::as_data_frame(x = .)
 
# check the long format iris dataset
iris_long
#> # A tibble: 600 x 5
#>    Species    id part  measure value
#>    <fct>   <int> <chr> <chr>   <dbl>
#>  1 setosa      1 Sepal Length    5.1
#>  2 setosa      2 Sepal Length    4.9
#>  3 setosa      3 Sepal Length    4.7
#>  4 setosa      4 Sepal Length    4.6
#>  5 setosa      5 Sepal Length    5  
#>  6 setosa      6 Sepal Length    5.4
#>  7 setosa      7 Sepal Length    4.6
#>  8 setosa      8 Sepal Length    5  
#>  9 setosa      9 Sepal Length    4.4
#> 10 setosa     10 Sepal Length    4.9
#> # ... with 590 more rows
 
# checking if the Sepal part has different dimentions (value) than Petal part
# for each Species and for each type of measurement (Length and Width)
options(tibble.width = Inf)            # show me all columns
 
groupedstats::grouped_ttest(
  data = iris_long,
  dep.vars = value,                    # dependent variable
  indep.vars = part,                   # independent variable
  grouping.vars = c(Species, measure), # for each Species and for each measurement
  paired = TRUE                        # paired t-test
)
#> # A tibble: 6 x 11
#>   Species    measure formula      method        t.test estimate conf.low
#>   <fct>      <chr>   <chr>        <chr>          <dbl>    <dbl>    <dbl>
#> 1 setosa     Length  value ~ part Paired t-test  -71.8   -3.54     -3.64
#> 2 versicolor Length  value ~ part Paired t-test  -34.0   -1.68     -1.78
#> 3 virginica  Length  value ~ part Paired t-test  -22.9   -1.04     -1.13
#> 4 setosa     Width   value ~ part Paired t-test  -61.0   -3.18     -3.29
#> 5 versicolor Width   value ~ part Paired t-test  -43.5   -1.44     -1.51
#> 6 virginica  Width   value ~ part Paired t-test  -23.1   -0.948    -1.03
#>   conf.high parameter  p.value significance
#>       <dbl>     <dbl>    <dbl> <chr>       
#> 1    -3.44         49 2.54e-51 ***         
#> 2    -1.58         49 9.67e-36 ***         
#> 3    -0.945        49 7.99e-28 ***         
#> 4    -3.08         49 7.21e-48 ***         
#> 5    -1.38         49 8.42e-41 ***         
#> 6    -0.866        49 5.34e-28 ***

grouped_wilcox

This function is just a non-parametric variant of the grouped_ttest:

# checking if the Sepal part has different dimentions (value) than Petal part
# for each Species and for each type of measurement (Length and Width)
options(tibble.width = Inf)            # show me all columns
 
groupedstats::grouped_wilcox(
  data = iris_long,
  dep.vars = value,                    # dependent variable
  indep.vars = part,                   # independent variable
  grouping.vars = c(Species, measure), # for each Species and for each measurement
  paired = TRUE                        # paired Wilcoxon signed rank test with continuity correction
)
#> # A tibble: 6 x 10
#>   Species    measure formula     
#>   <fct>      <chr>   <chr>       
#> 1 setosa     Length  value ~ part
#> 2 versicolor Length  value ~ part
#> 3 virginica  Length  value ~ part
#> 4 setosa     Width   value ~ part
#> 5 versicolor Width   value ~ part
#> 6 virginica  Width   value ~ part
#>   method                                               statistic estimate
#>   <chr>                                                    <dbl>    <dbl>
#> 1 Wilcoxon signed rank test with continuity correction         0   -3.55 
#> 2 Wilcoxon signed rank test with continuity correction         0   -1.70 
#> 3 Wilcoxon signed rank test with continuity correction         0   -1.05 
#> 4 Wilcoxon signed rank test with continuity correction         0   -3.15 
#> 5 Wilcoxon signed rank test with continuity correction         0   -1.45 
#> 6 Wilcoxon signed rank test with continuity correction         0   -0.950
#>   conf.low conf.high  p.value significance
#>      <dbl>     <dbl>    <dbl> <chr>       
#> 1    -3.65    -3.45  7.62e-10 ***         
#> 2    -1.80    -1.60  7.45e-10 ***         
#> 3    -1.15    -0.950 7.48e-10 ***         
#> 4    -3.25    -3.10  7.40e-10 ***         
#> 5    -1.55    -1.40  7.58e-10 ***         
#> 6    -1.00    -0.850 7.67e-10 ***

While we are at it, let’s also check out examples for t-test and Wilcox test in case of between-subjects designs.

We will use diamonds dataset from ggplot2 and will see if the price and depth of a diamond is different for two of our favorite colors (say E and J) for each type of clarity.

 
# subset the dataframe with two colors of interest to us
diamonds_short <-
  dplyr::filter(.data = ggplot2::diamonds, color == "E" |
                  color == "J")
 
options(tibble.width = Inf, tibble.print_max = Inf)             # show me all rows and columns
 
# t-test
groupedstats::grouped_ttest(
  data = diamonds_short,
  dep.vars = c(carat, price, depth),             # note that there three dependent variables 
  indep.vars = color,                            # and just one independent variable 
  grouping.vars = clarity,                       # one grouping variable
  paired = FALSE,
  var.equal = FALSE
)
#> # A tibble: 24 x 10
#>    clarity formula       method                   t.test   estimate
#>    <ord>   <chr>         <chr>                     <dbl>      <dbl>
#>  1 SI2     carat ~ color Welch Two Sample t-test -18.0      -0.503 
#>  2 SI1     carat ~ color Welch Two Sample t-test -22.1      -0.462 
#>  3 VS1     carat ~ color Welch Two Sample t-test -16.8      -0.444 
#>  4 VVS2    carat ~ color Welch Two Sample t-test -12.0      -0.553 
#>  5 VS2     carat ~ color Welch Two Sample t-test -24.6      -0.542 
#>  6 I1      carat ~ color Welch Two Sample t-test  -4.48     -0.644 
#>  7 VVS1    carat ~ color Welch Two Sample t-test  -6.53     -0.417 
#>  8 IF      carat ~ color Welch Two Sample t-test  -2.38     -0.198 
#>  9 SI2     price ~ color Welch Two Sample t-test -10.6   -2347.    
#> 10 SI1     price ~ color Welch Two Sample t-test -12.6   -2024.    
#> 11 VS1     price ~ color Welch Two Sample t-test  -9.26  -2028.    
#> 12 VVS2    price ~ color Welch Two Sample t-test  -6.83  -2643.    
#> 13 VS2     price ~ color Welch Two Sample t-test -14.3   -2560.    
#> 14 I1      price ~ color Welch Two Sample t-test  -2.76  -1766.    
#> 15 VVS1    price ~ color Welch Two Sample t-test  -3.35  -1814.    
#> 16 IF      price ~ color Welch Two Sample t-test   0.378   305.    
#> 17 SI2     depth ~ color Welch Two Sample t-test  -2.64     -0.231 
#> 18 SI1     depth ~ color Welch Two Sample t-test   0.333     0.0208
#> 19 VS1     depth ~ color Welch Two Sample t-test  -6.61     -0.417 
#> 20 VVS2    depth ~ color Welch Two Sample t-test  -2.20     -0.277 
#> 21 VS2     depth ~ color Welch Two Sample t-test  -2.28     -0.145 
#> 22 I1      depth ~ color Welch Two Sample t-test  -3.96     -2.10  
#> 23 VVS1    depth ~ color Welch Two Sample t-test  -2.53     -0.370 
#> 24 IF      depth ~ color Welch Two Sample t-test  -2.16     -0.386 
#>     conf.low  conf.high parameter   p.value significance
#>        <dbl>      <dbl>     <dbl>     <dbl> <chr>       
#>  1    -0.558    -0.448      634.  9.24e- 59 ***         
#>  2    -0.503    -0.420      948.  2.47e- 87 ***         
#>  3    -0.496    -0.392      663.  2.91e- 53 ***         
#>  4    -0.644    -0.461      140.  3.81e- 23 ***         
#>  5    -0.586    -0.499      856.  1.51e-101 ***         
#>  6    -0.932    -0.357       60.6 3.34e-  5 ***         
#>  7    -0.545    -0.290       76.1 6.68e-  9 ***         
#>  8    -0.364    -0.0318      60.1 2.03e-  2 *           
#>  9 -2782.    -1913.         676.  2.02e- 24 ***         
#> 10 -2339.    -1709.        1031.  5.75e- 34 ***         
#> 11 -2458.    -1598.         794.  1.86e- 19 ***         
#> 12 -3407.    -1878.         151.  1.94e- 10 ***         
#> 13 -2911.    -2209.         943.  3.07e- 42 ***         
#> 14 -3044.     -487.          62.0 7.58e-  3 **          
#> 15 -2893.     -735.          81.2 1.24e-  3 **          
#> 16 -1299.     1909.          81.0 7.07e-  1 ns          
#> 17    -0.402    -0.0588     772.  8.58e-  3 **          
#> 18    -0.102     0.143     1245.  7.39e-  1 ns          
#> 19    -0.541    -0.293     1040.  6.08e- 11 ***         
#> 20    -0.526    -0.0279     158.  2.95e-  2 *           
#> 21    -0.271    -0.0203    1084.  2.28e-  2 *           
#> 22    -3.16     -1.05        81.6 1.61e-  4 ***         
#> 23    -0.661    -0.0792      88.1 1.32e-  2 *           
#> 24    -0.739    -0.0321     111.  3.28e-  2 *
 
# wilcox test (aka Mann-Whitney U-test)
groupedstats::grouped_wilcox(
  data = diamonds_short,
  dep.vars = depth:price,                        # note that you can select variables in range with `:`
  indep.vars = color,                            # again, just one independent, multiple dependent variables case
  grouping.vars = clarity,                       # one grouping variable
  paired = FALSE
)
#> # A tibble: 16 x 9
#>    clarity formula       method                                           
#>    <ord>   <chr>         <chr>                                            
#>  1 SI2     depth ~ color Wilcoxon rank sum test with continuity correction
#>  2 SI1     depth ~ color Wilcoxon rank sum test with continuity correction
#>  3 VS1     depth ~ color Wilcoxon rank sum test with continuity correction
#>  4 VVS2    depth ~ color Wilcoxon rank sum test with continuity correction
#>  5 VS2     depth ~ color Wilcoxon rank sum test with continuity correction
#>  6 I1      depth ~ color Wilcoxon rank sum test with continuity correction
#>  7 VVS1    depth ~ color Wilcoxon rank sum test with continuity correction
#>  8 IF      depth ~ color Wilcoxon rank sum test with continuity correction
#>  9 SI2     price ~ color Wilcoxon rank sum test with continuity correction
#> 10 SI1     price ~ color Wilcoxon rank sum test with continuity correction
#> 11 VS1     price ~ color Wilcoxon rank sum test with continuity correction
#> 12 VVS2    price ~ color Wilcoxon rank sum test with continuity correction
#> 13 VS2     price ~ color Wilcoxon rank sum test with continuity correction
#> 14 I1      price ~ color Wilcoxon rank sum test with continuity correction
#> 15 VVS1    price ~ color Wilcoxon rank sum test with continuity correction
#> 16 IF      price ~ color Wilcoxon rank sum test with continuity correction
#>    statistic      estimate   conf.low     conf.high  p.value significance
#>        <dbl>         <dbl>      <dbl>         <dbl>    <dbl> <chr>       
#>  1   380600     -0.200        -0.300     -0.0000364 1.54e- 2 *           
#>  2   918892      0.0000393    -0.1000     0.100     6.77e- 1 ns          
#>  3   276743     -0.400        -0.500     -0.300     7.03e-12 ***         
#>  4    52618     -0.400        -0.600     -0.200     4.18e- 4 ***         
#>  5   816784.    -0.200        -0.300     -0.1000    8.86e- 5 ***         
#>  6     1570.    -2.00         -3.00      -1.000     1.21e- 4 ***         
#>  7    19455     -0.300        -0.500     -0.1000    5.07e- 3 **          
#>  8     3286     -0.300        -0.700     -0.0000251 4.79e- 2 *           
#>  9   269016  -1844.        -2182.     -1505.        8.82e-31 ***         
#> 10   608724. -1510.        -1789.     -1312.        8.20e-43 ***         
#> 11   277389  -1095.        -1414.      -745.        1.11e-11 ***         
#> 12    42795  -1764.        -2920.     -1205.        2.23e-10 ***         
#> 13   560930  -1888.        -2228.     -1539.        1.08e-54 ***         
#> 14     2017  -1150.        -1832.       -79.0       3.68e- 2 *           
#> 15    19086   -362.         -859.      -107.        2.57e- 3 **          
#> 16     5085    296.          125.       449.        4.94e- 3 **

We can further focus just on two levels of clarity to further elucidate another aspect of entering the arguments-

# subset the dataframe even further to just select two levels of clarity
diamonds_short2 <-
  dplyr::filter(.data = diamonds_short, clarity == "SI2" |
                  clarity == "SI1")
 
# wilcox test (aka Mann-Whitney U-test)
groupedstats::grouped_wilcox(
 data = diamonds_short2,
  dep.vars = c(carat, price),                    # two dependent variables
  indep.vars = c(color, clarity),                # two independent variables
  grouping.vars = cut,                           # one grouping variable
  paired = FALSE
)
#> # A tibble: 10 x 9
#>    cut       formula        
#>    <ord>     <chr>          
#>  1 Ideal     carat ~ color  
#>  2 Premium   carat ~ color  
#>  3 Good      carat ~ color  
#>  4 Very Good carat ~ color  
#>  5 Fair      carat ~ color  
#>  6 Ideal     price ~ clarity
#>  7 Premium   price ~ clarity
#>  8 Good      price ~ clarity
#>  9 Very Good price ~ clarity
#> 10 Fair      price ~ clarity
#>    method                                            statistic estimate
#>    <chr>                                                 <dbl>    <dbl>
#>  1 Wilcoxon rank sum test with continuity correction   103150.   -0.440
#>  2 Wilcoxon rank sum test with continuity correction    91748    -0.540
#>  3 Wilcoxon rank sum test with continuity correction    20752    -0.400
#>  4 Wilcoxon rank sum test with continuity correction    87590    -0.390
#>  5 Wilcoxon rank sum test with continuity correction     2356.   -0.230
#>  6 Wilcoxon rank sum test with continuity correction   344274.  774.   
#>  7 Wilcoxon rank sum test with continuity correction   329276.  876.   
#>  8 Wilcoxon rank sum test with continuity correction    64082.  516.   
#>  9 Wilcoxon rank sum test with continuity correction   272064   752.   
#> 10 Wilcoxon rank sum test with continuity correction     5113   170.   
#>    conf.low conf.high  p.value significance
#>       <dbl>     <dbl>    <dbl> <chr>       
#>  1   -0.500    -0.380 1.28e-51 ***         
#>  2   -0.600    -0.490 1.82e-59 ***         
#>  3   -0.500    -0.310 4.49e-18 ***         
#>  4   -0.460    -0.320 7.06e-37 ***         
#>  5   -0.370    -0.110 1.21e- 5 ***         
#>  6  536.     1075.    3.00e- 9 ***         
#>  7  538.     1278.    3.52e- 9 ***         
#>  8  141.      921.    3.05e- 3 **          
#>  9  486.     1098.    2.76e- 8 ***         
#> 10 -411.      802.    5.68e- 1 ns

In these examples, two things are worth noting that generalize to all functions in this package and stem from how tidy evaluation (https://adv-r.hadley.nz/evaluation.html) works:

  • If just one independent variable is provided for multiple dependent variables, it will be used as a common variable.
  • If you want to use a selection of variables, you need not use c().

Extending with purrr

groupedstats functions can be further extended with purrr package. For example, let’s say we want to run the same linear regression across multiple grouping variables but want to use different formulas-

set.seed(123)
library(groupedstats)
 
results_df <- purrr::pmap_dfr(
  .l = list(
    data = list(groupedstats::movies_long),
    grouping.vars = alist(c(mpaa, genre)), # note it's `alist` and not `list`
    formula = list(
      rating ~ budget,           # model 1
      rating ~ log(budget),      # model 2
      log(rating) ~ budget,      # model 3
      log(rating) ~ log(budget)  # model 4
    ),
    output = list("glance")      # return model diagnostics
  ),
  .f = groupedstats::grouped_lm, # regression model
  .id = "model"
) %>%  # for each combination of mpaa rating and movie genre
  dplyr::group_by(.data = ., mpaa, genre) %>% # arrange by best to worst fits
  dplyr::arrange(.data = ., dplyr::desc(adj.r.squared))
 
head(results_df)
#> # A tibble: 6 x 15
#> # Groups:   mpaa, genre [18]
#>   model mpaa  genre       r.squared adj.r.squared  sigma statistic    df
#>   <chr> <fct> <fct>           <dbl>         <dbl>  <dbl>     <dbl> <int>
#> 1 2     PG-13 Animation       0.474         0.369 0.824       4.51     2
#> 2 4     PG-13 Animation       0.447         0.337 0.138       4.05     2
#> 3 3     PG    Documentary     0.468         0.202 0.0532      1.76     2
#> 4 1     PG    Documentary     0.449         0.174 0.386       1.63     2
#> 5 4     PG-13 Romance         0.142         0.138 0.254      34.6      2
#> 6 2     PG-13 Romance         0.129         0.125 1.31       30.9      2
#>     logLik     AIC     BIC  deviance df.residual      p.value significance
#>      <dbl>   <dbl>   <dbl>     <dbl>       <int>        <dbl> <chr>       
#> 1   -7.40    20.8    20.6    3.39              5 0.0870       ns          
#> 2    5.09    -4.18   -4.35   0.0957            5 0.100        ns          
#> 3    7.45    -8.90  -10.7    0.00565           2 0.316        ns          
#> 4   -0.479    6.96    5.12   0.298             2 0.330        ns          
#> 5   -9.39    24.8    34.8   13.5             209 0.0000000162 ***         
#> 6 -356.     718.    728.   361.              209 0.0000000825 ***

Current code coverage

As the code stands right now, here is the code coverage for all primary functions involved: https://codecov.io/gh/IndrajeetPatil/groupedstats/tree/master/R

Contributing

I’m happy to receive bug reports, suggestions, questions, and (most of all) contributions to fix problems and add features. I personally prefer using the Github issues system over trying to reach out to me in other ways (personal e-mail, Twitter, etc.). Pull requests for contributions are encouraged.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Suggestions

If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/groupedstats/issues

News

groupedstats 0.0.3

MAJOR CHANGES

  • New functions exported: lm_effsize_ci(), signif_column(), specify_decimal_p().

groupedstats 0.0.2

MAJOR CHANGES

  • New functions added: grouped_lm(), grouped_aov(), grouped_lmer(), grouped_glmer().
  • To be consistent with the rest of the regression model functions, grouped_glm now takes a formula argument rather than dep.vars and indep.vars arguments.
  • The package no longer exports tidy eval functions; it's not part of the tidyverse.

MINOR CHANGES

  • For numeric variables, grouped_summary now also outputs standard error of mean and 95% confidence intervals for the mean.
  • Added datasets (Titanic_full, movies_long).

BUG FUXES

  • Some functions were not working if the grouping variable was called "group". Fixed that.

groupedstats 0.0.1

  • First version of the package.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.