Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.
R package: Subsetting methods for balanced cross-validation, time series windowing, and general grouping and splitting of data.
By Ludvig R. Olsen, Cognitive Science, Aarhus University. Started in Oct. 2016
Contact at: [email protected]
groupdata2 contains a number of vignettes with relevant use cases and descriptions.
vignette(package='groupdata2') # for an overview vignette("introduction_to_groupdata2") # begin here
Returns a factor with group numbers, e.g. (1,1,1,2,2,2,3,3,3).
This can be used to subset, aggregate, group_by, etc.
Create equally sized groups by setting force_equal = TRUE
Randomize grouping factor by setting randomize = TRUE
Returns the given data as a dataframe with added grouping factor made with group_factor(). The dataframe is grouped by the grouping factor for easy use with dplyr pipelines.
Creates the specified groups with group_factor() and splits the given data by the grouping factor with base::split. Returns the splits in a list.
Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on one categorical variable and/or make sure that all datapoints sharing an ID is in the same partition.
Creates (optionally) balanced folds for use in cross-validation. Balance folds on one categorical variable and/or make sure that all datapoints sharing an ID is in the same fold.
There are currently 9 methods available. They can be divided into 5 categories.
Examples of group sizes are based on a vector with 57 elements.
Divides up the data greedily given a specified group size.
E.g. group sizes: 10, 10, 10, 10, 10, 7
Divides the data into a specified number of groups and distributes excess data points across groups.
E.g. group sizes: 11, 11, 12, 11, 12
Divides the data into a specified number of groups and fills up groups with excess data points from the beginning.
E.g. group sizes: 12, 12, 11, 11, 11
Divides the data into a specified number of groups. The algorithm finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size.
E.g. group sizes: 11, 11, 11, 11, 13
Divides the data into a specified number of groups. Excess data points are placed randomly in groups (only 1 per group).
E.g. group sizes: 12, 11, 11, 11, 12
Uses a list / vector of group sizes to divide up the data. Excess data points are placed in an extra group.
E.g. n = c(11, 11) returns group sizes: 11, 11, 35
Uses a list of starting positions to divide up the data. Starting positions are values in a vector (e.g. column in dataframe). Skip to a specific nth appearance of a value by using c(value, skip_to).
E.g. n = c(11, 15, 27, 43) returns group sizes: 10, 4, 12, 16, 15
Identical to n = list(11, 15, c(27, 1), 43) where 1 specifies that we want the first appearance of 27 after the previous value 15.
If passing n = 'auto' starting posititions are automatically found with find_starts().
Uses step_size to divide up the data. Group size increases with 1 step for every group, until there is no more data.
E.g. group sizes: 5, 10, 15, 20, 7
Creates groups with sizes corresponding to prime numbers. Starts at n (prime number). Increases to the the next prime number until there is no more data.
E.g. group sizes: 5, 7, 11, 13, 17, 4
# Attach packageslibrary(groupdata2)library(dplyr)library(knitr)
# Create dataframedf <- data.frame("x"=c(1:12),"species" = rep(c('cat','pig', 'human'), 4),"age" = sample(c(1:100), 12))
# Using group()group(df, n = 5, method = 'n_dist') %>%kable()
# Using group() with dplyr pipeline to get mean agedf %>%group(n = 5, method = 'n_dist') %>%dplyr::summarise(mean_age = mean(age)) %>%kable()
# Using group() with 'l_starts' method# Starts group at the first 'cat',# then skips to the second appearance of "pig" after "cat",# then starts at the following "cat".df %>%group(n = list("cat", c("pig",2), "cat"),method = 'l_starts',starts_col = "species") %>%kable()
# Create dataframedf <- data.frame("participant" = factor(rep(c('1','2', '3', '4', '5', '6'), 3)),"age" = rep(c(20,23,27,21,32,31), 3),"diagnosis" = rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3),"score" = c(10,24,15,35,24,14,24,40,30,50,54,25,45,67,40,78,62,30))df <- df[order(df$participant),]df$session <- rep(c('1','2', '3'), 6)
# Using fold()# First set seed to ensure reproducibilityset.seed(1)# Use fold() with cat_col and id_coldf_folded <- fold(df, k = 3, cat_col = 'diagnosis',id_col = 'participant', method = 'n_dist')# Show df_folded ordered by foldsdf_folded[order(df_folded$.folds),] %>%kable()
# Show distribution of diagnoses and participantsdf_folded %>%group_by(.folds) %>%count(diagnosis, participant) %>%kable()
Notice that the we now have the opportunity to include the session variable and/or use participant as a random effect in our model when doing cross-validation, as any participant will only appear in one fold.
We also have a balance in the representation of each diagnosis, which could give us better, more consistent results.
New main function: partition() - used for creating balanced partitions by partition sizes
New method category: l_ methods - n is passed as a list
New method: 'l_sizes' - Uses list of group sizes to create grouping factor. Can be used for partitioning (e.g. n = c(0.2, 0.3) returns 3 groups with 0.2 (20%), 0.3 (30%) and the exceeding 0.5 (50%) of the data points)
New method: 'l_starts' - Uses list of start positions to create groups. Define which values from a vector to start a new group at. Skip to later appearances of a value. Use n = 'auto' to automatically find starts using find_starts()
New helper tool: 'find_starts' - Finds start positions in a vector. I.e. values that differ from the previous value. Get the values or indices of the values. Output can be used as n in 'l_starts' method.
New helper tool: 'find_missing_starts' - Returns the start posititions that would be recursively removed when using the 'l_starts' with remove_missing_starts set to TRUE.
Added argument 'remove_missing_starts' to grouping functions. Recursively remove the starting positions not found with 'l_starts' method.
New method: 'primes' - similar to 'staircase' but with primes as steps (e.g. group sizes 2,3,5,7..)
New remainder tool: '%primes%' - similar to %staircase% but for the new primes method
Submitted package to CRAN
Main functions and tools of this version is group_factor(), group(), splt(), fold(), and %staircase%