Creating Groups from Data

Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling.


R package: Subsetting methods for balanced cross-validation, time series windowing, and general grouping and splitting of data.

By Ludvig R. Olsen, Cognitive Science, Aarhus University. Started in Oct. 2016

Contact at: [email protected]

Main functions:

  • group_factor
  • group
  • splt
  • partition
  • fold

Other tools:

  • find_starts
  • %staircase%
  • %primes%

Installation

CRAN version:

Development version:

install.packages("devtools") devtools::install_github("LudvigOlsen/groupdata2")

Vignettes

groupdata2 contains a number of vignettes with relevant use cases and descriptions.

vignette(package='groupdata2') # for an overview vignette("introduction_to_groupdata2") # begin here

Functions

Returns a factor with group numbers, e.g. (1,1,1,2,2,2,3,3,3).

This can be used to subset, aggregate, group_by, etc.

Create equally sized groups by setting force_equal = TRUE

Randomize grouping factor by setting randomize = TRUE

group()

Returns the given data as a dataframe with added grouping factor made with group_factor(). The dataframe is grouped by the grouping factor for easy use with dplyr pipelines.

splt()

Creates the specified groups with group_factor() and splits the given data by the grouping factor with base::split. Returns the splits in a list.

partition()

Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on one categorical variable and/or make sure that all datapoints sharing an ID is in the same partition.

fold()

Creates (optionally) balanced folds for use in cross-validation. Balance folds on one categorical variable and/or make sure that all datapoints sharing an ID is in the same fold.

Methods

There are currently 9 methods available. They can be divided into 5 categories.

Examples of group sizes are based on a vector with 57 elements.

Specify group size

Method: greedy

Divides up the data greedily given a specified group size.

E.g. group sizes: 10, 10, 10, 10, 10, 7

Specify number of groups

Method: n_dist (Default)

Divides the data into a specified number of groups and distributes excess data points across groups.

E.g. group sizes: 11, 11, 12, 11, 12

Method: n_fill

Divides the data into a specified number of groups and fills up groups with excess data points from the beginning.

E.g. group sizes: 12, 12, 11, 11, 11

Method: n_last

Divides the data into a specified number of groups. The algorithm finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size.

E.g. group sizes: 11, 11, 11, 11, 13

Method: n_rand

Divides the data into a specified number of groups. Excess data points are placed randomly in groups (only 1 per group).

E.g. group sizes: 12, 11, 11, 11, 12

Specify list

Method: l_sizes

Uses a list / vector of group sizes to divide up the data. Excess data points are placed in an extra group.

E.g. n = c(11, 11) returns group sizes: 11, 11, 35

Method: l_starts

Uses a list of starting positions to divide up the data. Starting positions are values in a vector (e.g. column in dataframe). Skip to a specific nth appearance of a value by using c(value, skip_to).

E.g. n = c(11, 15, 27, 43) returns group sizes: 10, 4, 12, 16, 15

Identical to n = list(11, 15, c(27, 1), 43) where 1 specifies that we want the first appearance of 27 after the previous value 15.

If passing n = 'auto' starting posititions are automatically found with find_starts().

Specify step size

Method: staircase

Uses step_size to divide up the data. Group size increases with 1 step for every group, until there is no more data.

E.g. group sizes: 5, 10, 15, 20, 7

Specify start at

Method: primes

Creates groups with sizes corresponding to prime numbers. Starts at n (prime number). Increases to the the next prime number until there is no more data.

E.g. group sizes: 5, 7, 11, 13, 17, 4

Examples

# Attach packages
library(groupdata2)
library(dplyr)
library(knitr)
# Create dataframe
df <- data.frame("x"=c(1:12),
  "species" = rep(c('cat','pig', 'human'), 4),
  "age" = sample(c(1:100), 12))

group()

# Using group()
group(df, n = 5, method = 'n_dist') %>%
  kable()
x species age .groups
1 cat 81 1
2 pig 64 1
3 human 48 2
4 cat 24 2
5 pig 60 3
6 human 1 3
7 cat 37 3
8 pig 74 4
9 human 76 4
10 cat 47 5
11 pig 83 5
12 human 68 5
 
# Using group() with dplyr pipeline to get mean age
df %>%
  group(n = 5, method = 'n_dist') %>%
  dplyr::summarise(mean_age = mean(age)) %>%
  kable()
.groups mean_age
1 72.50000
2 36.00000
3 32.66667
4 75.00000
5 66.00000
 
# Using group() with 'l_starts' method
# Starts group at the first 'cat', 
# then skips to the second appearance of "pig" after "cat",
# then starts at the following "cat".
df %>%
  group(n = list("cat", c("pig",2), "cat"), 
        method = 'l_starts',
        starts_col = "species") %>%
  kable()
x species age .groups
1 cat 81 1
2 pig 64 1
3 human 48 1
4 cat 24 1
5 pig 60 2
6 human 1 2
7 cat 37 3
8 pig 74 3
9 human 76 3
10 cat 47 3
11 pig 83 3
12 human 68 3

fold()

# Create dataframe
df <- data.frame(
  "participant" = factor(rep(c('1','2', '3', '4', '5', '6'), 3)),
  "age" = rep(c(20,23,27,21,32,31), 3),
  "diagnosis" = rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3),
  "score" = c(10,24,15,35,24,14,24,40,30,50,54,25,45,67,40,78,62,30))
df <- df[order(df$participant),]
df$session <- rep(c('1','2', '3'), 6)
# Using fold() 
 
# First set seed to ensure reproducibility
set.seed(1)
 
# Use fold() with cat_col and id_col
df_folded <- fold(df, k = 3, cat_col = 'diagnosis',
                  id_col = 'participant', method = 'n_dist')
 
# Show df_folded ordered by folds
df_folded[order(df_folded$.folds),] %>%
  kable()
participant age diagnosis score session .folds
1 20 a 10 1 1
1 20 a 24 2 1
1 20 a 45 3 1
4 21 b 35 1 1
4 21 b 50 2 1
4 21 b 78 3 1
6 31 a 14 1 2
6 31 a 25 2 2
6 31 a 30 3 2
5 32 b 24 1 2
5 32 b 54 2 2
5 32 b 62 3 2
3 27 a 15 1 3
3 27 a 30 2 3
3 27 a 40 3 3
2 23 b 24 1 3
2 23 b 40 2 3
2 23 b 67 3 3
 
# Show distribution of diagnoses and participants
df_folded %>% 
  group_by(.folds) %>% 
  count(diagnosis, participant) %>% 
  kable()
.folds diagnosis participant n
1 a 1 3
1 b 4 3
2 a 6 3
2 b 5 3
3 a 3 3
3 b 2 3

Notice that the we now have the opportunity to include the session variable and/or use participant as a random effect in our model when doing cross-validation, as any participant will only appear in one fold.

We also have a balance in the representation of each diagnosis, which could give us better, more consistent results.

News

groupdata2 1.0.0

  • New main function: partition() - used for creating balanced partitions by partition sizes

  • New method category: l_ methods - n is passed as a list

  • New method: 'l_sizes' - Uses list of group sizes to create grouping factor. Can be used for partitioning (e.g. n = c(0.2, 0.3) returns 3 groups with 0.2 (20%), 0.3 (30%) and the exceeding 0.5 (50%) of the data points)

  • New method: 'l_starts' - Uses list of start positions to create groups. Define which values from a vector to start a new group at. Skip to later appearances of a value. Use n = 'auto' to automatically find starts using find_starts()

  • New helper tool: 'find_starts' - Finds start positions in a vector. I.e. values that differ from the previous value. Get the values or indices of the values. Output can be used as n in 'l_starts' method.

  • New helper tool: 'find_missing_starts' - Returns the start posititions that would be recursively removed when using the 'l_starts' with remove_missing_starts set to TRUE.

  • Added argument 'remove_missing_starts' to grouping functions. Recursively remove the starting positions not found with 'l_starts' method.

  • New method: 'primes' - similar to 'staircase' but with primes as steps (e.g. group sizes 2,3,5,7..)

  • New remainder tool: '%primes%' - similar to %staircase% but for the new primes method

groupdata2 0.1.0

  • Submitted package to CRAN

  • Main functions and tools of this version is group_factor(), group(), splt(), fold(), and %staircase%

groupdata2 0.0.0.9000

  • Created package :)

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("groupdata2")

1.3.0 by Ludvig Renbo Olsen, 3 months ago


https://github.com/ludvigolsen/groupdata2


Report a bug at https://github.com/ludvigolsen/groupdata2/issues


Browse source code at https://github.com/cran/groupdata2


Authors: Ludvig Renbo Olsen [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports checkmate, dplyr, numbers, lifecycle, plyr, purrr, rlang, stats, tibble, utils

Suggests broom, covr, ggplot2, hydroGOF, knitr, lmerTest, rmarkdown, testthat, tidyr, xpectr


Suggested by cvms.


See at CRAN