Fits Models Inside the Database

Uses 'dplyr' and 'tidyeval' to fit statistical models inside the database. It currently supports KMeans and linear regression models.


Build Status Coverage Status

modeldb

Fit models inside the database. modeldb works with several databases back-ends because it leverages dplyr and dbplyr for the final SQL translation of the algorithm. It currently supports:

  • K-means clustering

  • Linear regression

Installation

Install the development version using devtools as follows:

devtools::install_github("edgararuiz/modeldb")

Linear regression

An easy way to try out the package is by creating a temporary SQLite database, and loading mtcars to it

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
RSQLite::initExtension(con)
dplyr::copy_to(con, mtcars)
library(dplyr)
 
tbl(con, "mtcars") %>%
  select(wt, mpg, qsec) %>%
  linear_regression(wt)
##      mpg  qsec Intercept
##    <dbl> <dbl>     <dbl>
## 1 -0.156 0.125      4.12

The model output can be parsed by tidypredict to run the predictions in the database. Please see the Linear Regression article to learn more about how to use linear_regression()

K Means clustering

To use the simple_kmeans() function, simply pipe the database back end table to the function. This returns a list object that contains two items:

  • A sql query table with the final center assignment
  • A local table with the information about the centers
km <- tbl(con, "mtcars") %>%
  simple_kmeans(mpg, wt)
 
km$centers
## # A tibble: 3 x 2
##     mpg    wt
##   <dbl> <dbl>
## 1  20.6  3.07
## 2  14.5  4.06
## 3  30.1  1.87

To preview the record level center assignment use the tbl item

head(km$tbl)
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.22.0 []
##     mpg    wt center  
##   <dbl> <dbl> <chr>   
## 1  21.0  2.62 center_1
## 2  21.0  2.88 center_1
## 3  22.8  2.32 center_1
## 4  21.4  3.22 center_1
## 5  18.7  3.44 center_1
## 6  18.1  3.46 center_1

The SQL statement from tbl can be extracted using dbplyr's remote_query()

dbplyr::remote_query(km$tbl)
## <SQL> SELECT `mpg`, `wt`, `center`
## FROM (SELECT `mpg`, `wt`, `center_1`, `center_2`, `center_3`, CASE
## WHEN (`center_1` >= `center_1` AND `center_1` < `center_2` AND `center_1` < `center_3`) THEN ('center_1')
## WHEN (`center_2` < `center_1` AND `center_2` >= `center_2` AND `center_2` < `center_3`) THEN ('center_2')
## WHEN (`center_3` < `center_1` AND `center_3` < `center_2` AND `center_3` >= `center_3`) THEN ('center_3')
## END AS `center`
## FROM (SELECT `mpg`, `wt`, SQRT(((20.6428571428571 - `mpg`) * (20.6428571428571 - `mpg`)) + ((3.07214285714286 - `wt`) * (3.07214285714286 - `wt`))) AS `center_1`, SQRT(((14.4583333333333 - `mpg`) * (14.4583333333333 - `mpg`)) + ((4.05866666666667 - `wt`) * (4.05866666666667 - `wt`))) AS `center_2`, SQRT(((30.0666666666667 - `mpg`) * (30.0666666666667 - `mpg`)) + ((1.873 - `wt`) * (1.873 - `wt`))) AS `center_3`
## FROM (SELECT *
## FROM (SELECT `mpg`, `wt`
## FROM `mtcars`)
## WHERE (NOT(((`mpg`) IS NULL)) AND NOT(((`wt`) IS NULL))))))
## WHERE (NOT(((`center`) IS NULL)))

More information can be found in the KMeans Clustering article.

Supported models

The following R models are currently supported. For more info please review the corresponding vignette:

News

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("modeldb")

0.1.1 by Edgar Ruiz, 14 days ago


https://github.com/edgararuiz/modeldb


Report a bug at https://github.com/edgararuiz/modeldb/issues


Browse source code at https://github.com/cran/modeldb


Authors: Edgar Ruiz [aut, cre]


Documentation:   PDF Manual  


GPL-3 license


Imports dplyr, rlang, purrr, tibble, tidyr, progress, ggplot2, readr

Suggests dbplyr, testthat, knitr, rmarkdown, nycflights13, RSQLite, methods, DBI, dbplot, covr


See at CRAN