Leverages 'dplyr' to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels: outputs a 'ggplot', outputs the calculations, outputs the formula needed to calculate bins.
Leverages dplyr
to process the calculations of a plot inside a
database. This package provides helper functions that abstract the work
at three levels:
ggplot2
objectdata.frame
object with the calculations# You can install the released version from CRANinstall.packages("dbplot")# Or the the development version from GitHub:install.packages("devtools")devtools::install_github("edgararuiz/dbplot")
For more information on how to connect to databases, including Hive, please visit http://db.rstudio.com
To use Spark, please visit the sparklyr
official website:
http://spark.rstudio.com
In addition to database connections, the functions work with sparklyr
.
A Spark DataFrame will be used for the examples in this README.
library(sparklyr)sc <- spark_connect(master = "local", version = "2.1.0")spark_flights <- copy_to(sc, nycflights13::flights, "flights")
ggplot
By default dbplot_histogram()
creates a 30 bin histogram
library(ggplot2)spark_flights %>%dbplot_histogram(sched_dep_time)
Use binwidth
to fix the bin size
spark_flights %>%dbplot_histogram(sched_dep_time, binwidth = 200)
Because it outputs a ggplot2
object, more customization can be done
spark_flights %>%dbplot_histogram(sched_dep_time, binwidth = 300) +labs(title = "Flights - Scheduled Departure Time") +theme_bw()
To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.
A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.
fill
argument is passed, the default calculation will be
count, n()
spark_flights %>%filter(!is.na(arr_delay)) %>%dbplot_raster(arr_delay, dep_delay)
spark_flights %>%filter(!is.na(arr_delay)) %>%dbplot_raster(arr_delay, dep_delay, mean(distance, na.rm = TRUE))
resolution
argument controls that, it defaults to 100spark_flights %>%filter(!is.na(arr_delay)) %>%dbplot_raster(arr_delay, dep_delay, mean(distance, na.rm = TRUE), resolution = 500)
dbplot_bar()
defaults to a tally() of each value in a discrete
variablespark_flights %>%dbplot_bar(origin)
spark_flights %>%dbplot_bar(origin, mean(dep_delay))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
dbplot_line()
defaults to a tally() of each value in a discrete
variablespark_flights %>%dbplot_line(month)
spark_flights %>%dbplot_line(month, mean(dep_delay))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
spark_flights %>%dbplot_boxplot(origin, dep_delay)
If a more customized plot is needed, the data the underpins the plots can also be accessed:
db_compute_bins()
- Returns a data frame with the bins and count
per bindb_compute_count()
- Returns a data frame with the count per
discrete valuedb_compute_raster()
- Returns a data frame with the results per
x/y intersectiondb_compute_boxplot()
- Returns a data frame with boxplot
calculationsspark_flights %>%db_compute_bins(arr_delay)
## # A tibble: 28 x 2
## arr_delay count
## <dbl> <dbl>
## 1 4.53 79784.
## 2 -40.7 207999.
## 3 95.1 7890.
## 4 49.8 19063.
## 5 819. 8.
## 6 140. 3746.
## 7 321. 232.
## 8 231. 921.
## 9 -86.0 5325.
## 10 186. 1742.
## # ... with 18 more rows
The data can be piped to a plot
spark_flights %>%filter(arr_delay < 100 , arr_delay > -50) %>%db_compute_bins(arr_delay) %>%ggplot() +geom_col(aes(arr_delay, count, fill = count))
db_bin()
Uses ‘rlang’ to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.
db_bin(var)
## (((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30) * ifelse(as.integer(floor((var -
## min(var, na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var,
## na.rm = TRUE))/30))) == 30, as.integer(floor((var - min(var,
## na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30))) -
## 1, as.integer(floor((var - min(var, na.rm = TRUE))/((max(var,
## na.rm = TRUE) - min(var, na.rm = TRUE))/30))))) + min(var,
## na.rm = TRUE)
spark_flights %>%group_by(x = !! db_bin(arr_delay)) %>%tally()
## # Source: lazy query [?? x 2]
## # Database: spark_connection
## x n
## <dbl> <dbl>
## 1 4.53 79784.
## 2 -40.7 207999.
## 3 95.1 7890.
## 4 49.8 19063.
## 5 819. 8.
## 6 140. 3746.
## 7 321. 232.
## 8 231. 921.
## 9 -86.0 5325.
## 10 186. 1742.
## # ... with more rows
spark_flights %>%filter(!is.na(arr_delay)) %>%group_by(x = !! db_bin(arr_delay)) %>%tally()%>%collect %>%ggplot() +geom_col(aes(x, n))
Supports multiple aggregations for bar and line charts
Supports naming aggregations for bar and line charts
Adds compatability with rlang 0.2.0 upgrade
Improves dependency management
Adds compatability with dbplyr 1.2.0 upgrade
Adds complete
argument to db_compute_raster()
and dbplot_raster()
which fills in empty bins (#5)
Coerce aggregate results using as.numeric()
to handle integer64
results (#6)
compute
functions now return an ungrouped data.frame
unused argument (na.rm = TRUE)
message when used with the CRAN version of dbplyr
(#3)