Enables the creation of object pools, which make it less computationally expensive to fetch a new object. Currently the only supported pooled objects are 'DBI' connections.
The goal of the pool
package is to abstract away the logic of connection management and the performance cost of fetching a new connection from a remote database. These concerns are especially prominent in interactive contexts, like Shiny apps (which connect to a remote database) or even at the R console. So, while this package is of most practical value to Shiny developers, there is no harm if it is used in other contexts. Since pool
integrates with both DBI
and dplyr
, there are very few things that will be new to you, if you're already using either of those packages. Essentially, you shouldn't feel the difference, with the exception of creating and closing a Pool object (as opposed to connecting and disconnecting a DBIConnection object).
Here’s a simple example of using a pool within a Shiny app (feel free to try it yourself):
library(shiny)library(dplyr)library(pool) pool <- dbPool( drv = RMySQL::MySQL(), dbname = "shinydemo", host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com", username = "guest", password = "guest")onStop(function() { poolClose(pool)}) ui <- fluidPage( textInput("ID", "Enter your ID:", "5"), tableOutput("tbl"), numericInput("nrows", "How many cities to show?", 10), plotOutput("popPlot")) server <- function(input, output, session) { output$tbl <- renderTable({ pool %>% tbl("City") %>% filter(ID == input$ID) %>% collect() }) output$popPlot <- renderPlot({ df <- pool %>% tbl("City") %>% head(input$nrows) %>% collect() pop <- df$Population names(pop) <- df$Name barplot(pop) })} shinyApp(ui, server)
The pool
package adds a new level of abstraction when connecting to a database: instead of directly fetching a connection from the database, you will create an object (called a pool) with a reference to that database. The pool holds a number of connections to the database. Some of these may be currently in-use and some of these may be idle, waiting for a query to request them. Each time you make a query, you are querying the pool, rather than the database. Under the hood, the pool will either give you an idle connection that it previously fetched from the database or, if it has no free connections, fetch one and give it to you. You never have to create or close connections directly: the pool knows when it should grow, shrink or keep steady. You only need to close the pool when you’re done.
When you’re connecting to a database, it is important to manage your connections: when to open them (taking into account that this is a potentially long process for remote databases), how to keep track of them, and when to close them. This is always true, but it becomes especially relevant for Shiny apps, where not following best practices can lead to many slowdowns (from inadvertently opening too many connections) and/or many leaked connections (i.e. forgetting to close connections once you no longer need them). Over time, leaked connections could accumulate and substantially slow down your app, as well as overwhelming the database itself.
Oversimplifying a bit, we can think of connection management in Shiny as a spectrum from the extreme of just having one connection per app (potentially serving several sessions of the app) to the extreme of opening (and closing) one connection for each query you make. Neither of these approaches is great. You can expand either of the arrows below to see the source code for each extreme, but that is not essential to understanding the problems described below.
oneConnectionPerApp.R
library(shiny)library(dplyr)library(DBI) conn <- dbConnect( drv = RMySQL::MySQL(), dbname = "shinydemo", host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com", username = "guest", password = "guest" )onStop(function() { dbDisconnect(conn)}) ui <- fluidPage( textInput("ID", "Enter your ID:", "5"), tableOutput("tbl"), numericInput("nrows", "How many cities to show?", 10), plotOutput("popPlot")) server <- function(input, output, session) { output$tbl <- renderTable({ conn %>% tbl("City") %>% filter(ID == input$ID) %>% collect() }) output$popPlot <- renderPlot({ df <- conn %>% tbl("City") %>% head(input$nrows) %>% collect() pop <- df$Population names(pop) <- df$Name barplot(pop) })} shinyApp(ui, server)
oneConnectionPerQuery.R
library(shiny)library(dplyr)library(DBI) args <- list( drv = RMySQL::MySQL(), dbname = "shinydemo", host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com", username = "guest", password = "guest") ui <- fluidPage( textInput("ID", "Enter your ID:", "5"), tableOutput("tbl"), numericInput("nrows", "How many cities to show?", 10), plotOutput("popPlot")) server <- function(input, output, session) { output$tbl <- renderTable({ conn <- do.call(dbConnect, args) on.exit(dbDisconnect(conn)) conn %>% tbl("City") %>% filter(ID == input$ID) %>% collect() }) output$popPlot <- renderPlot({ conn <- do.call(dbConnect, args) on.exit(dbDisconnect(conn)) df <- conn %>% tbl("City") %>% head(input$nrows) %>% collect() pop <- df$Population names(pop) <- df$Name barplot(pop) })} shinyApp(ui, server)
Opening only one connection per app makes it fast (because, in the whole app, you only fetch one connection) and your code is kept as simple as possible. However:
While the other extreme of opening (and closing) one connection for each query you make resolves all of these points, it is terribly slow (each time we need to access the database, we first have to fetch a connection), and you need a lot more (boilerplate) code to connect and disconnect the connection within each reactive/function.
The pool
package was created so you don't have to worry about this at all. Since pool
abstracts away the logic of connection management, for the vast majority of cases, you never have to deal with connections directly. Since the pool “knows” when it should have more connections and how to manage them, you have all the advantages of the second approach (one connection per query), without the disadvantages. You are still using one connection per query, but that connection is always fetched and returned to the pool, rather than getting it from the database directly. This is a whole lot faster and more efficient. Finally, the code is kept just as simple as the code in the first approach (only one connection for the entire app). In fact, if you look back at the pool
Shiny app example above, you will notice that the code structure is essentially the same that you'd use to open a connection at the start of an app and close it at the end.
dbExistsTable()
, dbRemoveTable()
, dbWriteTable()
, dbGetQuery()
, dbExecute()
, dbListFields()
and dbReadTable()
to dispatch over the first two arguments, as per the default definition in DBI. (#57)requireNamespace("pkg", quietly = TRUE)
for RMySQL
and RSQLite
in the examples and tests since these are "Suggests" packages (i.e. not "Depends"). (commit #4205feb)Included more examples in the documentation. (#50)
Fixed the "test-create-destroy.R" test. Previously, this test had to be run manually because it uses later
and its async nature could not be captured by testthat
. However, using later::run_now()
immediately after the relevant code snippet (i.e. still inside the first argument of expect_*
) solves the issue. (#50)
Use difftime(t1, t0, units = "secs")
when calculating a time interval. Unlike the simpler t1 - t0
method, this guarantees that the result will always be consistently the number of seconds. However, there's no change when calculating a new time (not a time interval) using t2 <- t1 - interval
, since we want t2
to be a time, rather than a time interval (which is always what is returned by difftime
). (#50 and #48, thank you @caewok!)
Fix all dbplyr wrapper functions that weren't passing in additional arguments because the call to the original dbplyr
function included ... = ...
instead of ...
. Also, pass temporary = temporary
in copy_to.Pool
, so that we don't defeat the whole purpose of that wrapper. (#50)
Change the place where the check for the maximum number of objects is made. Previously, this chunk of code was misplaced and it would result in buggy behavior: namely, once the maximum number of objects was reached, no more objects could be checked out (even if you returned any/all objects back to the pool). The only reason this wasn't spotted earlier is because the default maxSize
is Inf
(and there's usually not a good reason to change it). (#50)
Fix #39: Moved dplyr
support in pool
from dplyr
0.5.0 to dplyr
0.7.0, which includes a lot of breaking changes including the addition of a brand new package called dbplyr
. (#42)
For pool
users, the main change is that all the src_*
functions are now gone (from dplyr
and pool
). Therefore, if you had something like:
data <- src_pool(pool) %>% tbl("test")
You can just change it to the simpler:
data <- pool %>% tbl("test")
If you're still on an old version of dplyr
and want to use pool
as well, please install the package using the tag created for that purpose:
devtools::install_github("rstudio/[email protected]")
Changed all time arguments to accept number of seconds, instead of milliseconds. This is because this is what the later
package uses and there was no reason for pool
to be different, except backward compatibility. Since both time arguments to dbPool
(idleTimeout
and validationInterval
) have default values, we're hoping this change won't even be noticed by most users. If you were setting either of those directly, however, you will need to update your app if you update the pool
package. (#44)
Dropped the Pool methods around dbConnect
and dbDisconnect
, because these made it easier to lose track of whether you're operating on a Pool object or on a database connection directly. From now on, only these allow you to get a connection from the pool and return it back, respectively: (#44)
con <- poolCheckout(pool)poolReturn(con)
later
package for scheduling tasks (#44). This also has the side effect of fixing #40 and #43 since later
allows us to get rid of the naiveScheduler
completely.