Supplies higher-order fluid data transform operators that include pivot and anti-pivot as special cases. The methodology is describe in 'Zumel', 2018, "Fluid data reshaping with 'cdata'", < http://winvector.github.io/FluidData/FluidDataReshapingWithCdata.html> , doi:10.5281/zenodo.1173299 . Works on in-memory data or on remote data using 'rquery' and the 'DBI' database interface.
cdata
is a general data re-shaper that has the great virtue of adhering to the so-called "Rule of Representation":
The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003
The point being: it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.
Briefly: cdata
supplies data transform operators that:
DBI
data source.pivot
and un-pivot
.A quick example: plot iris petal and sepal dimensions in a faceted graph.
iris <- data.frame(iris)library("ggplot2")library("cdata")# build a control table with a "key column" flower_part# and "value columns" Length and Width#controlTable <- wrapr::qchar_frame(flower_part, Length , Width |Petal , Petal.Length, Petal.Width |Sepal , Sepal.Length, Sepal.Width )# do the unpivot to convert the row records to block recordsiris_aug <- rowrecs_to_blocks(iris,controlTable,columnsToCopy = c("Species"))ggplot(iris_aug, aes(x=Length, y=Width)) +geom_point(aes(color=Species, shape=Species)) +facet_wrap(~flower_part, labeller = label_both, scale = "free") +ggtitle("Iris dimensions") + scale_color_brewer(palette = "Dark2")
More details on the above example can be found here. A tutorial on how to design a controlTable
can be found here.
And some discussion of the nature of records in cdata
can be found here.
We can also exhibit a larger example of using cdata
to create a scatter-plot matrix, or pair plot:
iris <- data.frame(iris)library("ggplot2")library("cdata")# declare our columns of interestmeas_vars <- qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)category_variable <- "Species"# build a control with all pairs of variables as value columns# and pair_key as the key columncontrolTable <- data.frame(expand.grid(meas_vars, meas_vars,stringsAsFactors = FALSE))# name the value columns value1 and value2colnames(controlTable) <- qc(value1, value2)# insert first, or key columncontrolTable <- cbind(data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]),stringsAsFactors = FALSE),controlTable)# do the unpivot to convert the row records to multiple block recordsiris_aug <- rowrecs_to_blocks(iris,controlTable,columnsToCopy = category_variable)# unpack the key column into two variable keys for the facet_gridsplt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE)iris_aug$v1 <- vapply(splt, function(si) si[[1]], character(1))iris_aug$v2 <- vapply(splt, function(si) si[[2]], character(1))ggplot(iris_aug, aes(x=value1, y=value2)) +geom_point(aes_string(color=category_variable, shape=category_variable)) +facet_grid(v2~v1, labeller = label_both, scale = "free") +ggtitle("Iris dimensions") +scale_color_brewer(palette = "Dark2") +ylab(NULL) +xlab(NULL)
The above is now wrapped into a one-line command in WVPlots
.
And a quick database example:
library("cdata")library("rquery")use_spark <- FALSEif(use_spark) {my_db <- sparklyr::spark_connect(version='2.2.0',master = "local")} else {my_db <- DBI::dbConnect(RSQLite::SQLite(),":memory:")}# pivot exampled <- wrapr::build_frame("meas", "val" |"AUC" , 0.6 |"R2" , 0.2 )DBI::dbWriteTable(my_db,'d',d,temporary = TRUE)rstr(my_db, 'd')# table `d` SQLiteConnection# nrow: 2# 'data.frame': 2 obs. of 2 variables:# $ meas: chr "AUC" "R2"# $ val : num 0.6 0.2td <- db_td(my_db, "d")td# [1] "table(`d`; meas, val)"cT <- td %.>%build_pivot_control(.,columnToTakeKeysFrom= 'meas',columnToTakeValuesFrom= 'val') %.>%execute(my_db, .)print(cT)# meas val# 1 AUC AUC# 2 R2 R2tab <- td %.>%blocks_to_rowrecs(.,keyColumns = NULL,controlTable = cT,temporary = FALSE) %.>%materialize(my_db, .)print(tab)# [1] "table(`rquery_mat_84169225764052913511_0000000000`; AUC, R2)"rstr(my_db, tab)# table `rquery_mat_84169225764052913511_0000000000` SQLiteConnection# nrow: 1# 'data.frame': 1 obs. of 2 variables:# $ AUC: num 0.6# $ R2 : num 0.2if(use_spark) {sparklyr::spark_disconnect(my_db)} else {DBI::dbDisconnect(my_db)}
The cdata
package is a demonstration of the "coordinatized data" theory and includes an implementation of the "fluid data" methodology. The recommended tutorial is: Fluid data reshaping with cdata. We also have a short free cdata screencast (and another example can be found here).
Install via CRAN:
install.packages("cdata")
Note: cdata
is targeted at data with "tame column names" (column names that are valid both in databases, and as R
unquoted variable names) and basic types (column values that are simple R
types such as character
, numeric
, logical
, and so on).