Fluid Data Transformations

Supplies higher-order coordinatized data specification and fluid transform operators that include pivot and anti-pivot as special cases. The methodology is describe in 'Zumel', 2018, "Fluid data reshaping with 'cdata'", < http://winvector.github.io/FluidData/FluidDataReshapingWithCdata.html> , doi:10.5281/zenodo.1173299 . This package introduces the idea of control table specification of data transforms (later also adapted from 'cdata' by 'tidyr'). Works on in-memory data or on remote data using 'rquery' and 'SQL' database interfaces.


cdata is a general data re-shaper that has the great virtue of adhering to the so-called "Rule of Representation":

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The point being: it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

Briefly: cdata supplies data transform operators that:

  • Work on local data or with any DBI data source.
  • Are powerful generalizations of the operations commonly called pivot and un-pivot.

A quick example: plot iris petal and sepal dimensions in a faceted graph.

iris <- data.frame(iris)
 
library("ggplot2")
library("cdata")
# build a control table with a "key column" flower_part
# and "value columns" Length and Width
#
controlTable <- wrapr::qchar_frame(
   flower_part, Length      , Width       |
   Petal    , Petal.Length, Petal.Width |
   Sepal    , Sepal.Length, Sepal.Width )
 
# do the unpivot to convert the row records to block records
iris_aug <- rowrecs_to_blocks(
  iris,
  controlTable,
  columnsToCopy = c("Species"))
 
 
ggplot(iris_aug, aes(x=Length, y=Width)) +
  geom_point(aes(color=Species, shape=Species)) + 
  facet_wrap(~flower_part, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +  scale_color_brewer(palette = "Dark2")

More details on the above example can be found here. A tutorial on how to design a controlTable can be found here. And some discussion of the nature of records in cdata can be found here.


We can also exhibit a larger example of using cdata to create a scatter-plot matrix, or pair plot:

 
iris <- data.frame(iris)
 
library("ggplot2")
library("cdata")
 
# declare our columns of interest
meas_vars <- qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
category_variable <- "Species"
 
# build a control with all pairs of variables as value columns
# and pair_key as the key column
controlTable <- data.frame(expand.grid(meas_vars, meas_vars, 
                                       stringsAsFactors = FALSE))
# name the value columns value1 and value2
colnames(controlTable) <- qc(value1, value2)
# insert first, or key column
controlTable <- cbind(
  data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]),
             stringsAsFactors = FALSE),
  controlTable)
 
 
# do the unpivot to convert the row records to multiple block records
iris_aug <- rowrecs_to_blocks(
  iris,
  controlTable,
  columnsToCopy = category_variable)
 
# unpack the key column into two variable keys for the facet_grid
splt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE)
iris_aug$v1 <- vapply(splt, function(si) si[[1]], character(1))
iris_aug$v2 <- vapply(splt, function(si) si[[2]], character(1))
 
 
ggplot(iris_aug, aes(x=value1, y=value2)) +
  geom_point(aes_string(color=category_variable, shape=category_variable)) + 
  facet_grid(v2~v1, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +
  scale_color_brewer(palette = "Dark2") +
  ylab(NULL) + 
  xlab(NULL)

The above is now wrapped into a one-line command in WVPlots.

And a quick database example:

library("cdata")
library("rquery")
 
use_spark <- FALSE
 
if(use_spark) {
  my_db <- sparklyr::spark_connect(version='2.2.0', 
                                   master = "local")
} else {
  my_db <- DBI::dbConnect(RSQLite::SQLite(),
                          ":memory:")
}
 
 
 
# pivot example
d <- wrapr::build_frame(
   "meas", "val" |
   "AUC" , 0.6   |
   "R2"  , 0.2   )
DBI::dbWriteTable(my_db,
                  'd',
                  d,
                  temporary = TRUE)
rstr(my_db, 'd')
 #  table `d` SQLiteConnection 
 #   nrow: 2 
 #  'data.frame':   2 obs. of  2 variables:
 #   $ meas: chr  "AUC" "R2"
 #   $ val : num  0.6 0.2
td <- db_td(my_db, "d")
td
 #  [1] "table(`d`; meas, val)"
 
cT <- td %.>%
  build_pivot_control(.,
                      columnToTakeKeysFrom= 'meas',
                      columnToTakeValuesFrom= 'val') %.>%
  execute(my_db, .)
print(cT)
 #    meas val
 #  1  AUC AUC
 #  2   R2  R2
 
tab <- td %.>%
  blocks_to_rowrecs(.,
                    keyColumns = NULL,
                    controlTable = cT,
                    temporary = FALSE) %.>%
  materialize(my_db, .)
 
print(tab)
 #  [1] "table(`rquery_mat_84169225764052913511_0000000000`; AUC, R2)"
  
rstr(my_db, tab)
 #  table `rquery_mat_84169225764052913511_0000000000` SQLiteConnection 
 #   nrow: 1 
 #  'data.frame':   1 obs. of  2 variables:
 #   $ AUC: num 0.6
 #   $ R2 : num 0.2
 
if(use_spark) {
  sparklyr::spark_disconnect(my_db)
} else {
  DBI::dbDisconnect(my_db)
}

The cdata package is a demonstration of the "coordinatized data" theory and includes an implementation of the "fluid data" methodology. The recommended tutorial is: Fluid data reshaping with cdata. We also have a short free cdata screencast (and another example can be found here).


Install via CRAN:

install.packages("cdata")

Note: cdata is targeted at data with "tame column names" (column names that are valid both in databases, and as R unquoted variable names) and basic types (column values that are simple R types such as character, numeric, logical, and so on).

News

cdata 1.0.6 2019/02/14

  • More generality in control table keys.
  • Move to RUnit.
  • Less direct data.table.

cdata 1.0.5 2019/01/20

  • Unify S3 method signatures to allow generic programming over them.
  • Generic record to record transform.
  • Move more functions from DBI to rquery.

cdata 1.0.4 2019/01/04

  • More vignettes.
  • Improve doc cross-linking.
  • Switch to new f_df signature.

cdata 1.0.3 2018/10/20

  • Fix ragged gather bug.
  • More argument checking.

cdata 1.0.2 2018/10/08

  • Change defaults.
  • Some bug fixes.

cdata 1.0.1 2018/09/22

  • Clean up suggests.

cdata 1.0.0 2018/09/08

  • Neaten up uniqueness checking.

cdata 0.7.4 2018/08/16

  • rquery extension (moving methods to S3).
  • Documentation fixes.

cdata 0.7.3 2018/07/20

  • Documentation fixes.

cdata 0.7.2 2018/07/07

  • switch local ops to data.table implementation.
  • re-export more of wrapr
  • move db fns to rquery.

cdata 0.7.1 2018/06/16

  • Documentation fixes.
  • Don't export cols().
  • Reduce wrapr re-export.
  • More rows in qlook().

cdata 0.7.0 2018/04/09

  • Narrow dependencies.
  • Switch to dbExecute() (sparklyr seems to have that now).
  • Non-DB implementations for local data case.
  • Remove deprecated fns.

cdata 0.6.0 2018/03/12

  • Add cols() method.
  • Add doi link in DESCRIPTION (CRAN request).
  • Use build_frame(), draw_frame(), and qchar_frame (quoted frame) from wrapr 1.3.0.

cdata 0.5.2 2018/01/20

  • Remove append based row binding (seems to have some issues on Spark).
  • Deprecate old methods.

cdata 0.5.1 2018/01/03

  • New naming convention.
  • Doc fixes.
  • Better table lifetime controls.
  • Move to wrapr 1.0.2.
  • Move grepdf out of package.
  • Add row binder.
  • Add map_fields.
  • Add winvector_temp_db_handle support.

cdata 0.5.0 2017/11/13

  • query-based re-implementation
  • fluid data workflow.
  • remove dplyr and tidyr dependence

cdata 0.1.7 2017/10/31

  • Better error msgs.

cdata 0.1.6 2017/10/12

  • work around empty keyset issues.
  • add column control.

cdata 0.1.5 2017/07/04

  • Allow NA in key columns.
  • Add optional class annotation when moving values to rows.

cdata 0.1.1 2017/05/05

  • ungroup before calculating distinct.

cdata 0.1.0 2017/03/28

  • First release.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("cdata")

1.0.9 by John Mount, 2 days ago


https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/


Report a bug at https://github.com/WinVector/cdata/issues


Browse source code at https://github.com/cran/cdata


Authors: John Mount [aut, cre] , Nina Zumel [aut] , Win-Vector LLC [cph]


Documentation:   PDF Manual  


GPL-3 license


Imports wrapr, rquery, methods, stats

Suggests DBI, RSQLite, knitr, rqdatatable, RUnit


Imported by WVPlots.


See at CRAN