Fluid Data Transformations

Supplies higher-order coordinatized data specification and fluid transform operators that include pivot and anti-pivot as special cases. The methodology is describe in 'Zumel', 2018, "Fluid data reshaping with 'cdata'", < http://winvector.github.io/FluidData/FluidDataReshapingWithCdata.html> , doi:10.5281/zenodo.1173299 . This package introduces the idea of control table specification of data transforms (later also adapted from 'cdata' by 'tidyr'). Works on in-memory data or on remote data using 'rquery' and 'SQL' database interfaces.


cdata is a general data re-shaper that has the great virtue of adhering to Raymond's "Rule of Representation", and using Codd's "Guaranteed Access Rule".

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley, 2003

Rule 2: The guaranteed access rule.

Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

Edgar F. Codd

The point being: it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

Briefly: cdata supplies data transform operators that:

  • Work on local data or with any DBI data source.
  • Are powerful generalizations of the operations commonly called pivot and un-pivot.
  • Allow for example-driven graphical specification of data transforms or data layout control.
  • Work in-memory or with SQL databases.

A quick example: plot iris petal and sepal dimensions in a faceted graph.

iris <- data.frame(iris)
iris$iris_id <- seq_len(nrow(iris))
head(iris)
 #    Sepal.Length Sepal.Width Petal.Length Petal.Width Species iris_id
 #  1          5.1         3.5          1.4         0.2  setosa       1
 #  2          4.9         3.0          1.4         0.2  setosa       2
 #  3          4.7         3.2          1.3         0.2  setosa       3
 #  4          4.6         3.1          1.5         0.2  setosa       4
 #  5          5.0         3.6          1.4         0.2  setosa       5
 #  6          5.4         3.9          1.7         0.4  setosa       6
 
library("ggplot2")
 #  Warning: package 'ggplot2' was built under R version 3.5.2
library("cdata")
 
#
# build a control table with a "key column" flower_part
# and "value columns" Length and Width
#
controlTable <- wrapr::qchar_frame(
  "flower_part", "Length"     , "Width"     |
    "Petal"    , Petal.Length , Petal.Width |
    "Sepal"    , Sepal.Length , Sepal.Width )
 
transform <- rowrecs_to_blocks_spec(
  controlTable,
  recordKeys = c("iris_id", "Species"))
 
# do the unpivot to convert the row records to block records
iris_aug <- iris %.>% transform
 
# show the tranformed data
head(iris_aug)
 #    iris_id Species flower_part Length Width
 #  1       1  setosa       Petal    1.4   0.2
 #  2       1  setosa       Sepal    5.1   3.5
 #  3       2  setosa       Petal    1.4   0.2
 #  4       2  setosa       Sepal    4.9   3.0
 #  5       3  setosa       Petal    1.3   0.2
 #  6       3  setosa       Sepal    4.7   3.2
 
# plot the graph
ggplot(iris_aug, aes(x=Length, y=Width)) +
  geom_point(aes(color=Species, shape=Species)) + 
  facet_wrap(~flower_part, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +  scale_color_brewer(palette = "Dark2")

 
# show the transform
print(transform)
 #  {
 #   row_record <- wrapr::qchar_frame(
 #     "iris_id"  , "Species", "Petal.Length", "Sepal.Length", "Petal.Width", "Sepal.Width" |
 #       .        , .        , Petal.Length  , Sepal.Length  , Petal.Width  , Sepal.Width   )
 #   row_keys <- c('iris_id', 'Species')
 #  
 #   # becomes
 #  
 #   block_record <- wrapr::qchar_frame(
 #     "iris_id"  , "Species", "flower_part", "Length"    , "Width"     |
 #       .        , .        , "Petal"      , Petal.Length, Petal.Width |
 #       .        , .        , "Sepal"      , Sepal.Length, Sepal.Width )
 #   block_keys <- c('iris_id', 'Species', 'flower_part')
 #  
 #   # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE)
 #  }
 
# show the representation of the transform
unclass(transform)
 #  $controlTable
 #    flower_part       Length       Width
 #  1       Petal Petal.Length Petal.Width
 #  2       Sepal Sepal.Length Sepal.Width
 #  
 #  $recordKeys
 #  [1] "iris_id" "Species"
 #  
 #  $controlTableKeys
 #  [1] "flower_part"
 #  
 #  $checkNames
 #  [1] TRUE
 #  
 #  $checkKeys
 #  [1] FALSE
 #  
 #  $strict
 #  [1] FALSE
 #  
 #  $allow_rqdatatable
 #  [1] TRUE

More details on the above example can be found here. A tutorial on how to design a controlTable can be found here. And some discussion of the nature of records in cdata can be found here.


We can also exhibit a larger example of using cdata to create a scatter-plot matrix, or pair plot:

 
iris <- data.frame(iris)
iris$iris_id <- seq_len(nrow(iris))
 
library("ggplot2")
library("cdata")
 
# declare our columns of interest
meas_vars <- qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
category_variable <- "Species"
 
# build a control with all pairs of variables as value columns
# and pair_key as the key column
controlTable <- data.frame(expand.grid(meas_vars, meas_vars, 
                                       stringsAsFactors = FALSE))
# one copy of columns is coordinate names second copy is values
controlTable <- cbind(controlTable, controlTable)
# name the value columns value1 and value2
colnames(controlTable) <- qc(v1, v2, value1, value2)
transform <- rowrecs_to_blocks_spec(
  controlTable,
  recordKeys = c("iris_id", "Species"),
  controlTableKeys = qc(v1, v2),
  checkKeys = FALSE)
 
# do the unpivot to convert the row records to multiple block records
iris_aug <- iris %.>% transform
# alternate notation: layout_by(transform, iris)
 
 
ggplot(iris_aug, aes(x=value1, y=value2)) +
  geom_point(aes_string(color=category_variable, shape=category_variable)) + 
  facet_grid(v2~v1, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +
  scale_color_brewer(palette = "Dark2") +
  ylab(NULL) + 
  xlab(NULL)

 
# show transform
print(transform)
 #  {
 #   row_record <- wrapr::qchar_frame(
 #     "iris_id"  , "Species", "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width" |
 #       .        , .        , Sepal.Length  , Sepal.Width  , Petal.Length  , Petal.Width   )
 #   row_keys <- c('iris_id', 'Species')
 #  
 #   # becomes
 #  
 #   block_record <- wrapr::qchar_frame(
 #     "iris_id"  , "Species", "v1"          , "v2"          , "value1"    , "value2"     |
 #       .        , .        , "Sepal.Length", "Sepal.Length", Sepal.Length, Sepal.Length |
 #       .        , .        , "Sepal.Width" , "Sepal.Length", Sepal.Width , Sepal.Length |
 #       .        , .        , "Petal.Length", "Sepal.Length", Petal.Length, Sepal.Length |
 #       .        , .        , "Petal.Width" , "Sepal.Length", Petal.Width , Sepal.Length |
 #       .        , .        , "Sepal.Length", "Sepal.Width" , Sepal.Length, Sepal.Width  |
 #       .        , .        , "Sepal.Width" , "Sepal.Width" , Sepal.Width , Sepal.Width  |
 #       .        , .        , "Petal.Length", "Sepal.Width" , Petal.Length, Sepal.Width  |
 #       .        , .        , "Petal.Width" , "Sepal.Width" , Petal.Width , Sepal.Width  |
 #       .        , .        , "Sepal.Length", "Petal.Length", Sepal.Length, Petal.Length |
 #       .        , .        , "Sepal.Width" , "Petal.Length", Sepal.Width , Petal.Length |
 #       .        , .        , "Petal.Length", "Petal.Length", Petal.Length, Petal.Length |
 #       .        , .        , "Petal.Width" , "Petal.Length", Petal.Width , Petal.Length |
 #       .        , .        , "Sepal.Length", "Petal.Width" , Sepal.Length, Petal.Width  |
 #       .        , .        , "Sepal.Width" , "Petal.Width" , Sepal.Width , Petal.Width  |
 #       .        , .        , "Petal.Length", "Petal.Width" , Petal.Length, Petal.Width  |
 #       .        , .        , "Petal.Width" , "Petal.Width" , Petal.Width , Petal.Width  )
 #   block_keys <- c('iris_id', 'Species', 'v1', 'v2')
 #  
 #   # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE)
 #  }

The above is now wrapped into a one-line command in WVPlots.


The cdata package develops the idea of the "coordinatized data" theory and includes an implementation of the "fluid data" methodology.

The main cdata interfaces are given by the following set of methods:

Some convenience functions include:

  • pivot_to_rowrecs(), for moving data from multi-row block records with one value per row (a single column of values) to single-row records [spread or dcast].
  • pivot_to_blocks()/unpivot_to_blocks(), for moving data from single-row records to possibly multi row block records with one row per value (a single column of values) [gather or melt].
  • wrapr::qchar_frame() a helper function for specifying record control table layout specifications.
  • wrapr::build_frame() a helper function for specifying data frames.

The package vignettes can be found in the "Articles" tab of the cdata documentation site.

The (older) recommended tutorial is: Fluid data reshaping with cdata. We also have a (older) short free cdata screencast (and another example can be found here). These concepts were later adapted from cdata by the tidyr package.


Install via CRAN:

install.packages("cdata")

Note: cdata is targeted at data with "tame column names" (column names that are valid both in databases, and as R unquoted variable names) and basic types (column values that are simple R types such as character, numeric, logical, and so on).

News

cdata 1.1.0 2019/04/27

  • Switch to rqdatatable implementation.
  • General transform specification.
  • More care with factors.
  • Update vignettes.
  • More tests with factors and dates/times.

cdata 1.0.9 2019/04/20

  • "layout" commands.
  • Deal better with duplicate entries in db-version of blocks to rows.
  • Move to wrapr draw_framec().
  • Fix typo in general transform example code.

cdata 1.0.8 2019/03/30

  • More column collision checks.
  • Operator notation.

cdata 1.0.7 2019/03/23

  • Move to wrapr tests.
  • Better error messages.
  • Better handling of NA in row-dup check.

cdata 1.0.6 2019/02/14

  • More generality in control table keys.
  • Move to RUnit.
  • Less direct data.table.

cdata 1.0.5 2019/01/20

  • Unify S3 method signatures to allow generic programming over them.
  • Generic record to record transform.
  • Move more functions from DBI to rquery.

cdata 1.0.4 2019/01/04

  • More vignettes.
  • Improve doc cross-linking.
  • Switch to new f_df signature.

cdata 1.0.3 2018/10/20

  • Fix ragged gather bug.
  • More argument checking.

cdata 1.0.2 2018/10/08

  • Change defaults.
  • Some bug fixes.

cdata 1.0.1 2018/09/22

  • Clean up suggests.

cdata 1.0.0 2018/09/08

  • Neaten up uniqueness checking.

cdata 0.7.4 2018/08/16

  • rquery extension (moving methods to S3).
  • Documentation fixes.

cdata 0.7.3 2018/07/20

  • Documentation fixes.

cdata 0.7.2 2018/07/07

  • switch local ops to data.table implementation.
  • re-export more of wrapr
  • move db fns to rquery.

cdata 0.7.1 2018/06/16

  • Documentation fixes.
  • Don't export cols().
  • Reduce wrapr re-export.
  • More rows in qlook().

cdata 0.7.0 2018/04/09

  • Narrow dependencies.
  • Switch to dbExecute() (sparklyr seems to have that now).
  • Non-DB implementations for local data case.
  • Remove deprecated fns.

cdata 0.6.0 2018/03/12

  • Add cols() method.
  • Add doi link in DESCRIPTION (CRAN request).
  • Use build_frame(), draw_frame(), and qchar_frame (quoted frame) from wrapr 1.3.0.

cdata 0.5.2 2018/01/20

  • Remove append based row binding (seems to have some issues on Spark).
  • Deprecate old methods.

cdata 0.5.1 2018/01/03

  • New naming convention.
  • Doc fixes.
  • Better table lifetime controls.
  • Move to wrapr 1.0.2.
  • Move grepdf out of package.
  • Add row binder.
  • Add map_fields.
  • Add winvector_temp_db_handle support.

cdata 0.5.0 2017/11/13

  • query-based re-implementation
  • fluid data workflow.
  • remove dplyr and tidyr dependence

cdata 0.1.7 2017/10/31

  • Better error msgs.

cdata 0.1.6 2017/10/12

  • work around empty keyset issues.
  • add column control.

cdata 0.1.5 2017/07/04

  • Allow NA in key columns.
  • Add optional class annotation when moving values to rows.

cdata 0.1.1 2017/05/05

  • ungroup before calculating distinct.

cdata 0.1.0 2017/03/28

  • First release.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("cdata")

1.1.0 by John Mount, 2 months ago


https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/


Report a bug at https://github.com/WinVector/cdata/issues


Browse source code at https://github.com/cran/cdata


Authors: John Mount [aut, cre] , Nina Zumel [aut] , Win-Vector LLC [cph]


Documentation:   PDF Manual  


GPL-3 license


Imports wrapr, rquery, methods, stats

Suggests rqdatatable, DBI, RSQLite, knitr, RUnit


Imported by WVPlots.


See at CRAN