Object Pooling

Enables the creation of object pools, which make it less computationally expensive to fetch a new object. Currently the only supported pooled objects are 'DBI' connections.


Travis-CI Build Status AppVeyor Build Status CRAN_Status_Badge

The goal of the pool package is to abstract away the logic of connection management and the performance cost of fetching a new connection from a remote database. These concerns are especially prominent in interactive contexts, like Shiny apps (which connect to a remote database) or even at the R console. So, while this package is of most practical value to Shiny developers, there is no harm if it is used in other contexts. Since pool integrates with both DBI and dplyr, there are very few things that will be new to you, if you're already using either of those packages. Essentially, you shouldn't feel the difference, with the exception of creating and closing a Pool object (as opposed to connecting and disconnecting a DBIConnection object).

Usage

Here’s a simple example of using a pool within a Shiny app (feel free to try it yourself):

library(shiny)
library(dplyr)
library(pool)
 
pool <- dbPool(
  drv = RMySQL::MySQL(),
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)
onStop(function() {
  poolClose(pool)
})
 
ui <- fluidPage(
  textInput("ID", "Enter your ID:", "5"),
  tableOutput("tbl"),
  numericInput("nrows", "How many cities to show?", 10),
  plotOutput("popPlot")
)
 
server <- function(input, output, session) {
  output$tbl <- renderTable({
    pool %>% tbl("City") %>% filter(ID == input$ID) %>% collect()
  })
  output$popPlot <- renderPlot({
    df <- pool %>% tbl("City") %>% head(input$nrows) %>% collect()
    pop <- df$Population
    names(pop) <- df$Name
    barplot(pop)
  })
}
 
shinyApp(ui, server)

Concept

The pool package adds a new level of abstraction when connecting to a database: instead of directly fetching a connection from the database, you will create an object (called a pool) with a reference to that database. The pool holds a number of connections to the database. Some of these may be currently in-use and some of these may be idle, waiting for a query to request them. Each time you make a query, you are querying the pool, rather than the database. Under the hood, the pool will either give you an idle connection that it previously fetched from the database or, if it has no free connections, fetch one and give it to you. You never have to create or close connections directly: the pool knows when it should grow, shrink or keep steady. You only need to close the pool when you’re done.

Context and motivation

When you’re connecting to a database, it is important to manage your connections: when to open them (taking into account that this is a potentially long process for remote databases), how to keep track of them, and when to close them. This is always true, but it becomes especially relevant for Shiny apps, where not following best practices can lead to many slowdowns (from inadvertently opening too many connections) and/or many leaked connections (i.e. forgetting to close connections once you no longer need them). Over time, leaked connections could accumulate and substantially slow down your app, as well as overwhelming the database itself.

Oversimplifying a bit, we can think of connection management in Shiny as a spectrum from the extreme of just having one connection per app (potentially serving several sessions of the app) to the extreme of opening (and closing) one connection for each query you make. Neither of these approaches is great. You can expand either of the arrows below to see the source code for each extreme, but that is not essential to understanding the problems described below.

oneConnectionPerApp.R
library(shiny)
library(dplyr)
library(DBI)
 
conn <- dbConnect(
    drv = RMySQL::MySQL(),
    dbname = "shinydemo",
    host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
    username = "guest",
    password = "guest"
  )
onStop(function() {
  dbDisconnect(conn)
})
 
ui <- fluidPage(
  textInput("ID", "Enter your ID:", "5"),
  tableOutput("tbl"),
  numericInput("nrows", "How many cities to show?", 10),
  plotOutput("popPlot")
)
 
server <- function(input, output, session) {
  output$tbl <- renderTable({
    conn %>% tbl("City") %>% filter(ID == input$ID) %>% collect()
  })
  output$popPlot <- renderPlot({
    df <- conn %>% tbl("City") %>% head(input$nrows) %>% collect()
    pop <- df$Population
    names(pop) <- df$Name
    barplot(pop)
  })
}
 
shinyApp(ui, server)
oneConnectionPerQuery.R
library(shiny)
library(dplyr)
library(DBI)
 
args <- list(
  drv = RMySQL::MySQL(),
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)
 
ui <- fluidPage(
  textInput("ID", "Enter your ID:", "5"),
  tableOutput("tbl"),
  numericInput("nrows", "How many cities to show?", 10),
  plotOutput("popPlot")
)
 
server <- function(input, output, session) {
  output$tbl <- renderTable({
    conn <- do.call(dbConnect, args)
    on.exit(dbDisconnect(conn))
 
    conn %>% tbl("City") %>% filter(ID == input$ID) %>% collect()
  })
  output$popPlot <- renderPlot({
    conn <- do.call(dbConnect, args)
    on.exit(dbDisconnect(conn))
 
    df <- conn %>% tbl("City") %>% head(input$nrows) %>% collect()
    pop <- df$Population
    names(pop) <- df$Name
    barplot(pop)
  })
}
 
shinyApp(ui, server)

Opening only one connection per app makes it fast (because, in the whole app, you only fetch one connection) and your code is kept as simple as possible. However:

  • it cannot handle simultaneous requests (e.g. two sessions open, both querying the database at the same time);
  • if the connection breaks at some point (maybe the database server crashed), you won’t get a new connection (you have to exit the app and re-run it);
  • finally, if you are not quite at this extreme, and you use more than one connection per app (but fewer than one connection per query), it can be difficult to keep track of all your connections, since you’ll be opening and closing them in potentially very different places.

While the other extreme of opening (and closing) one connection for each query you make resolves all of these points, it is terribly slow (each time we need to access the database, we first have to fetch a connection), and you need a lot more (boilerplate) code to connect and disconnect the connection within each reactive/function.

The pool package was created so you don't have to worry about this at all. Since pool abstracts away the logic of connection management, for the vast majority of cases, you never have to deal with connections directly. Since the pool “knows” when it should have more connections and how to manage them, you have all the advantages of the second approach (one connection per query), without the disadvantages. You are still using one connection per query, but that connection is always fetched and returned to the pool, rather than getting it from the database directly. This is a whole lot faster and more efficient. Finally, the code is kept just as simple as the code in the first approach (only one connection for the entire app). In fact, if you look back at the pool Shiny app example above, you will notice that the code structure is essentially the same that you'd use to open a connection at the start of an app and close it at the end.

More resources

  • db.rstudio.com for a lot of best practices, articles and demos on databases in R and in RStudio.
  • shiny.rstudio.com to learn more about Shiny (there are also more articles, including on data and databases in /articles).

News

pool 0.1.4.1

  • Change package maintainer

pool 0.1.4

  • Changed the methods dbExistsTable(), dbRemoveTable(), dbWriteTable(), dbGetQuery(), dbExecute(), dbListFields() and dbReadTable() to dispatch over the first two arguments, as per the default definition in DBI. (#57)

pool 0.1.3

  • Use requireNamespace("pkg", quietly = TRUE) for RMySQL and RSQLite in the examples and tests since these are "Suggests" packages (i.e. not "Depends"). (commit #4205feb)

pool 0.1.2

Minor new features and improvements

  • Included more examples in the documentation. (#50)

  • Fixed the "test-create-destroy.R" test. Previously, this test had to be run manually because it uses later and its async nature could not be captured by testthat. However, using later::run_now() immediately after the relevant code snippet (i.e. still inside the first argument of expect_*) solves the issue. (#50)

  • Use difftime(t1, t0, units = "secs") when calculating a time interval. Unlike the simpler t1 - t0 method, this guarantees that the result will always be consistently the number of seconds. However, there's no change when calculating a new time (not a time interval) using t2 <- t1 - interval, since we want t2 to be a time, rather than a time interval (which is always what is returned by difftime). (#50 and #48, thank you @caewok!)

Bug fixes

  • Fix all dbplyr wrapper functions that weren't passing in additional arguments because the call to the original dbplyr function included ... = ... instead of .... Also, pass temporary = temporary in copy_to.Pool, so that we don't defeat the whole purpose of that wrapper. (#50)

  • Change the place where the check for the maximum number of objects is made. Previously, this chunk of code was misplaced and it would result in buggy behavior: namely, once the maximum number of objects was reached, no more objects could be checked out (even if you returned any/all objects back to the pool). The only reason this wasn't spotted earlier is because the default maxSize is Inf (and there's usually not a good reason to change it). (#50)

pool 0.1.1

Breaking changes

  • Fix #39: Moved dplyr support in pool from dplyr 0.5.0 to dplyr 0.7.0, which includes a lot of breaking changes including the addition of a brand new package called dbplyr. (#42)

    For pool users, the main change is that all the src_* functions are now gone (from dplyr and pool). Therefore, if you had something like:

    data <- src_pool(pool) %>% tbl("test")

    You can just change it to the simpler:

    data <- pool %>% tbl("test")

    If you're still on an old version of dplyr and want to use pool as well, please install the package using the tag created for that purpose:

    devtools::install_github("rstudio/[email protected]")
  • Changed all time arguments to accept number of seconds, instead of milliseconds. This is because this is what the later package uses and there was no reason for pool to be different, except backward compatibility. Since both time arguments to dbPool (idleTimeout and validationInterval) have default values, we're hoping this change won't even be noticed by most users. If you were setting either of those directly, however, you will need to update your app if you update the pool package. (#44)

  • Dropped the Pool methods around dbConnect and dbDisconnect, because these made it easier to lose track of whether you're operating on a Pool object or on a database connection directly. From now on, only these allow you to get a connection from the pool and return it back, respectively: (#44)

    con <- poolCheckout(pool)
    poolReturn(con)

New features

  • Use later package for scheduling tasks (#44). This also has the side effect of fixing #40 and #43 since later allows us to get rid of the naiveScheduler completely.

Library updates

  • Roxygen 5.0.1 to 6.0.1. (commit #9952000)

pool 0.1.0

  • Initial release!

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("pool")

0.1.4.2 by Joe Cheng, 9 days ago


https://github.com/rstudio/pool


Report a bug at https://github.com/rstudio/pool/issues


Browse source code at https://github.com/cran/pool


Authors: Joe Cheng [aut, cre] , Barbara Borges [aut] , RStudio [cph]


Documentation:   PDF Manual  


Task views: Databases with R


GPL-3 license


Imports DBI, R6, dplyr, dbplyr, later

Depends on methods

Suggests testthat, tibble, RSQLite, RMySQL


Suggested by sf.


See at CRAN