Relational Query Generator for Data Manipulation at Scale

A piped query generator based on Edgar F. Codd's relational algebra, and on production experience using 'SQL' and 'dplyr' at big data scale. The design represents an attempt to make 'SQL' more teachable by denoting composition by a sequential pipeline notation instead of nested queries or functions. The implementation delivers reliable high performance data processing on large data systems such as 'Spark', databases, and 'data.table'. Package features include: data processing trees or pipelines as observable objects (able to report both columns produced and columns used), optimized 'SQL' generation as an explicit user visible table modeling step, plus explicit query reasoning and checking.

rquery is a piped query generator based on Codd's relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale in production).

rquery is currently recommended for use with data.table (via rqdatatable), PostgreSQL, sparklyr, SparkR, MonetDBLite, and (and with non-window functionality with RSQLite). It can target various databases through its adapter layer.

To install: devtools::install_github("WinVector/rquery") or install.packages("rquery").

Note: rquery is a "database first" design. This means choices are made that favor database implementation. These include: capturing the entire calculation prior to doing any work (and using recursive methods to inspect this object, which can limit the calculation depth to under 1000 steps at a time), preferring "tame column names" (which isn't a bad idea in R anyway as columns and variables are often seen as cousins), and not preserving row or column order (or supporting numeric column indexing). Also, rquery does have a fast in-memory implementation: rqdatatable (thanks to the data.table package), so one can in fact use rquery without a database.


rquery can be an excellent advanced SQL training tool (it shows how to build some very deep SQL by composing rquery operators). Currently rquery is biased towards the Spark and PostgeSQL SQL dialects.

There are many prior relational algebra inspired specialized query languages. Just a few include:

rquery is realized as a thin translation to an underlying SQL provider. We are trying to put the Codd relational operators front and center (using the original naming, and back-porting SQL progress such as window functions to the appropriate relational operator).

The primary relational operators include:

  • extend(). Extend adds derived columns to a relation table. With a sufficiently powerful SQL provider this includes ordered and partitioned window functions. This operator also includes built-in seplyr-style assignment partitioning. extend() can also alter existing columns, though we note this is not always a relational operation (it can lose row uniqueness).
  • project(). Project is usually portrayed as the equivalent to column selection, though the original definition includes aggregation. In our opinion the original relational nature of the operator is best captured by moving SQL's "GROUP BY" aggregation functionality.
  • natural_join(). This a specialized relational join operator, using all common columns as an equi-join condition.
  • theta_join(). This is the relational join operator allowing an arbitrary matching predicate.
  • select_rows(). This is Codd's relational row selection. Obviously select alone is an over-used and now ambiguous term (for example: it is already used as the "doit" verb in SQL and the column selector in dplyr).
  • rename_columns(). This operator renames sets of columns.
  • set_indicator(). This operator produces a new column indicating set membership of a named column.

(Note rquery prior to version 1.2.1 used a _nse() suffix yielding commands such as extend_nse() instead of the newer extend() shown here).

The primary non-relational (traditional SQL) operators are:

  • select_columns(). This allows choice of columns (central to SQL), but is not a relational operator as it can damage row-uniqueness.
  • orderby(). Row order is not a concept in the relational algebra (and also not maintained in most SQL implementations). This operator is only useful when used with its limit= option, or as the last step as data comes out of the relation store and is moved to R (where row-order is usually maintained).
  • map_column_values() re-map values in columns (very useful for re-coding data, currently implemented as a sql_node()).
  • unionall() concatenate tables.

And rquery supports higher-order (written in terms of other operators, both package supplied and user supplied):

  • pick_top_k(). Pick top k rows per group given a row ordering.
  • assign_slice(). Conditionally assign sets of rows and columns a scalar value.
  • if_else_op(). Simulate simultaneous if/else assignments.

rquery also has implementation helpers for building both SQL-nodes (nodes that are just SQL expressions) and non-SQL-nodes (nodes that are general functions of their input data values).

The primary missing relational operators are:

  • Union.
  • Direct set difference, anti-join.
  • Division.

One of the principles of rquery is to prefer expressive nodes, and not depend on complicated in-node expressions.

A great benefit of Codd's relational algebra is it gives one concepts to decompose complex data transformations into sequences of simpler transformations.

Some reasons SQL seems complicated include:

  • SQL's realization of sequencing as nested function composition.
  • SQL uses some relational concepts as steps, others as modifiers and predicates.

A lot of the grace of the Codd theory can be recovered through the usual trick changing function composition notation from g(f(x)) to x . f() . g(). This experiment is asking (and not for the first time): "what if SQL were piped (expressed composition as a left to right flow, instead of a right to left nesting)?"

Let's work a non-trivial example: the dplyr pipeline from Let’s Have Some Sympathy For The Part-time R User.

use_spark <- FALSE
if(use_spark) {
  raw_connection <- sparklyr::spark_connect(version='2.2.0', 
                                   master = "local")
  cname <- rq_connection_name(raw_connection)
                      "USING PARQUET OPTIONS ('compression'='snappy')")
} else {
  driver <- RPostgreSQL::PostgreSQL()
  raw_connection <- DBI::dbConnect(driver,
                          host = 'localhost',
                          port = 5432,
                          user = 'johnmount',
                          password = '')
dbopts <- rq_connection_tests(raw_connection)
##   RS-DBI driver: (could not Retrieve the result : ERROR:  syntax error at or near "INT"
## LINE 1: ...ARY VIEW "rq_test_23470105350733459902_0000000000" ( x INT )
##                                                                   ^
## )
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "rq_test_23470105350733459902_0000000000" does not exist
## LINE 1: SELECT * FROM "rq_test_23470105350733459902_0000000000" LIMI...
##                       ^
## )
db <- rquery_db_info(connection = raw_connection,
                     is_dbi = TRUE,
                     connection_options = dbopts)
# copy data in so we have an example
d_local <- build_frame(
   "subjectID", "surveyCategory"     , "assessmentTotal", "irrelevantCol1", "irrelevantCol2" |
   1L         , "withdrawal behavior", 5                , "irrel1"        , "irrel2"         |
   1L         , "positive re-framing", 2                , "irrel1"        , "irrel2"         |
   2L         , "withdrawal behavior", 3                , "irrel1"        , "irrel2"         |
   2L         , "positive re-framing", 4                , "irrel1"        , "irrel2"         )
rq_copy_to(db, 'd',
            temporary = TRUE, 
            overwrite = TRUE)
## [1] "table(\"d\"; subjectID, surveyCategory, assessmentTotal, irrelevantCol1, irrelevantCol2)"
# produce a hande to existing table
d <- db_td(db, "d")

Note: in examples we use rq_copy_to() to create data. This is only for the purpose of having easy portable examples. With big data the data is usually already in the remote database or Spark system. The task is almost always to connect and work with this pre-existing remote data and the method to do this is db_td(), which builds a reference to a remote table given the table name. The suggested pattern for working with remote tables is to get inputs via db_td() and land remote results with materialze(). To work with local data one can copy data from memory to the database with rq_copy_to() and bring back results with execute() (though be aware operation on remote non-memory data is rquery's primary intent).

First we show the Spark/database version of the original example data:

## [1] "rquery_db_info"
## [1] "rquery_db_info(PostgreSQLConnection, is_dbi=TRUE, note=\"\")"
## [1] "relop_table_source" "relop"
## [1] "table(\"d\"; subjectID, surveyCategory, assessmentTotal, irrelevantCol1, irrelevantCol2)"
# remote structure inspection
rstr(db, d$table_name)
## table "d" rquery_db_info 
##  nrow: 4 
## 'data.frame':    4 obs. of  5 variables:
##  $ subjectID      : int  1 1 2 2
##  $ surveyCategory : chr  "withdrawal behavior" "positive re-framing" "withdrawal behavior" "positive re-framing"
##  $ assessmentTotal: num  5 2 3 4
##  $ irrelevantCol1 : chr  "irrel1" "irrel1" "irrel1" "irrel1"
##  $ irrelevantCol2 : chr  "irrel2" "irrel2" "irrel2" "irrel2"
# or execute the table representation to bring back data
d %.>%
  execute(db, .) %.>%
subjectID surveyCategory assessmentTotal irrelevantCol1 irrelevantCol2
1 withdrawal behavior 5 irrel1 irrel2
1 positive re-framing 2 irrel1 irrel2
2 withdrawal behavior 3 irrel1 irrel2
2 positive re-framing 4 irrel1 irrel2

Now we re-write the original calculation in terms of the rquery SQL generating operators.

scale <- 0.237
dq <- d %.>%
         probability :=
           exp(assessmentTotal * scale))  %.>% 
                 partitionby = 'subjectID') %.>%
             partitionby = 'subjectID',
             orderby = c('probability', 'surveyCategory'),
             reverse = c('probability')) %.>% 
  rename_columns(., 'diagnosis' := 'surveyCategory') %.>%
  select_columns(., c('subjectID', 
                      'probability')) %.>%
  orderby(., cols = 'subjectID')

(Note one can also use the named map builder alias %:=% if there is concern of aliasing with data.table's definition of :=.)

We then generate our result:

result <- materialize(db, dq)
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "rquery_mat_38954678098867775608_0000000000" does not exist
## LINE 1: SELECT * FROM "rquery_mat_38954678098867775608_0000000000" L...
##                       ^
## )
## [1] "relop_table_source" "relop"
## [1] "table(\"rquery_mat_38954678098867775608_0000000000\"; subjectID, diagnosis, probability)"
DBI::dbReadTable(db$connection, result$table_name) %.>%
subjectID diagnosis probability
1 withdrawal behavior 0.6706221
2 positive re-framing 0.5589742

We see we have quickly reproduced the original result using the new database operators. This means such a calculation could easily be performed at a "big data" scale (using a database or Spark; in this case we would not take the results back, but instead use CREATE TABLE tname AS to build a remote materialized view of the results).

A bonus is, thanks to data.table and the rqdatatable packages we can run the exact same operator pipeline on local data.

d_local %.>% 
  dq %.>%
subjectID diagnosis probability
1 withdrawal behavior 0.6706221
2 positive re-framing 0.5589742

Notice we applied the pipeline by piping data into it. This ability is a feature of the dot arrow pipe we are using here.

The actual SQL query that produces the database result is, in fact, quite involved:

cat(to_sql(dq, db, source_limit = 1000))
   "subjectID" AS "subjectID",
   "surveyCategory" AS "diagnosis",
   "probability" AS "probability"
  FROM (
     row_number ( ) OVER (  PARTITION BY "subjectID" ORDER BY "probability" DESC, "surveyCategory" ) AS "row_number"
    FROM (
      "probability" / sum ( "probability" ) OVER (  PARTITION BY "subjectID" ) AS "probability"
     FROM (
       exp ( "assessmentTotal" * 0.237 )  AS "probability"
      FROM (
        "d" LIMIT 1000
       ) tsql_97627007110742070403_0000000000
      ) tsql_97627007110742070403_0000000001
     ) tsql_97627007110742070403_0000000002
   ) tsql_97627007110742070403_0000000003
   WHERE "row_number" <= 1
  ) tsql_97627007110742070403_0000000004
 ) tsql_97627007110742070403_0000000005
) tsql_97627007110742070403_0000000006 ORDER BY "subjectID"

The query is large, but due to its regular structure it should be very amenable to query optimization.

A feature to notice is: the query was automatically restricted to just columns actually needed from the source table to complete the calculation. This has the possibility of decreasing data volume and greatly speeding up query performance. Our initial experiments show rquery narrowed queries to be twice as fast as un-narrowed dplyr on a synthetic problem simulating large disk-based queries. We think if we connected directly to Spark's relational operators (avoiding the SQL layer) we may be able to achieve even faster performance.

The above optimization is possible because the rquery representation is an intelligible tree of nodes, so we can interrogate the tree for facts about the query. For example:

## [1] "subjectID"   "diagnosis"   "probability"
## [1] "d"
## $d
## [1] "subjectID"       "surveyCategory"  "assessmentTotal"

The additional record-keeping in the operator nodes allows checking and optimization (such as query narrowing). The flow itself is represented as follows:

  irrelevantCol2) %.>%
  probability := exp(assessmentTotal * 0.237)) %.>%
  probability := probability / sum(probability),
  p= subjectID) %.>%
  row_number := row_number(),
  p= subjectID,
  o= "probability" DESC, "surveyCategory") %.>%
   row_number <= 1) %.>%
  c('diagnosis' = 'surveyCategory')) %.>%
   subjectID, diagnosis, probability) %.>%
 orderby(., subjectID)
dq %.>%
  op_diagram(.) %.>% 

rquery also includes a number of useful utilities (both as nodes and as functions).

quantile_cols(db, "d")
##   quantile_probability subjectID      surveyCategory assessmentTotal
## 1                 0.00         1 positive re-framing               2
## 2                 0.25         1 positive re-framing               2
## 3                 0.50         1 positive re-framing               3
## 4                 0.75         2 withdrawal behavior               4
## 5                 1.00         2 withdrawal behavior               5
##   irrelevantCol1 irrelevantCol2
## 1         irrel1         irrel2
## 2         irrel1         irrel2
## 3         irrel1         irrel2
## 4         irrel1         irrel2
## 5         irrel1         irrel2
rsummary(db, "d")
##            column index     class nrows nna nunique min max mean        sd
## 1       subjectID     1   integer     4   0      NA   1   2  1.5 0.5773503
## 2  surveyCategory     2 character     4   0       2  NA  NA   NA        NA
## 3 assessmentTotal     3   numeric     4   0      NA   2   5  3.5 1.2909944
## 4  irrelevantCol1     4 character     4   0       1  NA  NA   NA        NA
## 5  irrelevantCol2     5 character     4   0       1  NA  NA   NA        NA
##                lexmin              lexmax
## 1                <NA>                <NA>
## 2 positive re-framing withdrawal behavior
## 3                <NA>                <NA>
## 4              irrel1              irrel1
## 5              irrel2              irrel2
dq %.>% 
  quantile_node(.) %.>%
  execute(db, .)
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "rquery_ex_13074711364751711343_0000000000" does not exist
## LINE 1: SELECT * FROM "rquery_ex_13074711364751711343_0000000000" LI...
##                       ^
## )
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "qn_60505155281560537591_0000000000" does not exist
## LINE 1: SELECT * FROM "qn_60505155281560537591_0000000000" LIMIT 1
##                       ^
## )
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "qn_60505155281560537591_0000000001" does not exist
## LINE 1: SELECT * FROM "qn_60505155281560537591_0000000001" LIMIT 1
##                       ^
## )
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "qn_60505155281560537591_0000000001" does not exist
## LINE 1: SELECT * FROM "qn_60505155281560537591_0000000001" LIMIT 1
##                       ^
## )

##   quantile_probability subjectID           diagnosis probability
## 1                 0.00         1 positive re-framing   0.5589742
## 2                 0.25         1 positive re-framing   0.5589742
## 3                 0.50         1 positive re-framing   0.5589742
## 4                 0.75         2 withdrawal behavior   0.6706221
## 5                 1.00         2 withdrawal behavior   0.6706221
dq %.>% 
  rsummary_node(.) %.>%
  execute(db, .)
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "rquery_ex_37964716047819115422_0000000000" does not exist
## LINE 1: SELECT * FROM "rquery_ex_37964716047819115422_0000000000" LI...
##                       ^
## )
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "sn_96672569214126175096_0000000000" does not exist
## LINE 1: SELECT * FROM "sn_96672569214126175096_0000000000" LIMIT 1
##                       ^
## )
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "sn_96672569214126175096_0000000001" does not exist
## LINE 1: SELECT * FROM "sn_96672569214126175096_0000000001" LIMIT 1
##                       ^
## )
## Error in postgresqlExecStatement(conn, statement, ...) : 
##   RS-DBI driver: (could not Retrieve the result : ERROR:  relation "sn_96672569214126175096_0000000001" does not exist
## LINE 1: SELECT * FROM "sn_96672569214126175096_0000000001" LIMIT 1
##                       ^
## )

##        column index     class nrows nna nunique       min       max
## 1   subjectID     1   integer     2   0      NA 1.0000000 2.0000000
## 2   diagnosis     2 character     2   0       2        NA        NA
## 3 probability     3   numeric     2   0      NA 0.5589742 0.6706221
##        mean         sd              lexmin              lexmax
## 1 1.5000000 0.70710678                <NA>                <NA>
## 2        NA         NA positive re-framing withdrawal behavior
## 3 0.6147982 0.07894697                <NA>                <NA>

We have found most big-data projects either require joining very many tables (something rquery join planners help with, please see here and here) or they require working with wide data-marts (where rquery query narrowing helps, please see here).

We can also stand rquery up on non-DBI sources such as SparkR and also data.table. The data.table adapter is being developed in the rqdatatable package, and can be quite fast. Notice the examples in this mode all essentially use the same query pipeline, the user can choose where to apply it: in memory (data.table), in a DBI database (PostgreSQL, Sparklyr), and with even non-DBI systems (SparkR).

See also

For deeper dives into specific topics, please see also:


To install rquery please try install.packages("rquery").


rquery 1.3.2 2019/03/10

  • Don't override names() (messes up str()).
  • Test checks that are correct when rqdatatable is attached.
  • Add rquery substitution vingette.
  • Remove non-ascii chars from source file to fix CRAN warning

rquery 1.3.1 2019/02/14

  • Fix dimnames().
  • Add more argument checks.
  • Fix temp-naming in joins.
  • More tests.

rquery 1.3.0 2019/01/29

  • User controlled SQL-tree re-writing.
  • More per-DB SQL rendering control.
  • Move to tree-based parse representation (instead of flat).
  • relop_list collector interface.
  • Specialize materialze_node for lineage breaking.
  • %%/MOD().
  • Na row interfaces (un-publish tokenize_for_SQL()).
  • Add mutable scratch area on nodes.

rquery 1.2.1 2018/12/16

  • versions without _nse() suffix.
  • parse ^ power.
  • reflexive arguments to non-sql nodes.
  • Add .[]-variable deceleration notation.
  • Narrow suggests.
  • update docs.
  • add order_expr.
  • declare : as an inline operator.
  • allow concatenating compatible pipelines.
  • check for table name consistency.
  • NULL/NA parsing.
  • Force order limit to be >= 0.

rquery 1.2.0 2018/11/06

  • fix name lookup issue.
  • add bquote() abstraction to extend_nse(), project_nse(), and select_rows_nse().
  • fix column check on select_rows*
  • confirm DBI on tests.

rquery 1.1.1 2018/10/26

  • Allow a bit more flexibility on ordering in extend.
  • Start breaking up some recursive calls.

rquery 1.1.0 2018/09/20

  • Alternate data.table implementation path.
  • lookup_by_column().
  • Force parent.frame().

rquery 1.0.0 2018/09/10

  • Fix key_inspector_postgresql quoting issue.
  • More tests.
  • Export a default database description.
  • Check more on project and extend interfaces.

rquery 0.6.2 2018/08/14

  • fix in-memory order_by!
  • aggregate aliases.
  • try to clean up immediate mode hooks a bit.

rquery 0.6.1 2018/08/01

  • Change to immediate execution.
  • Substitute in values in presentation layer (rqdatatable depends on this).
  • Fix name disambiguation.
  • Better column dependency calculation.
  • Re-map function names.
  • Improve extend narrowing.
  • Start on schema qualification.
  • Get rid of %>>% and old dbi fn-names.
  • Try to improve use of regexps.
  • Add qlook().
  • Add affine_transform().
  • Documentation fixes.

rquery 0.5.0 2018/06/18

  • Make DBI suggested.
  • Rename dbi_ prefixes to rq_ (for now have aliases from old to new).
  • Per-connection options.
  • Add expand_grid()/complete_design().
  • Get rid of embedded data cases and print-execution.
  • Move to new wrapr ops.
  • Rationalize names of table description methods.
  • Add rquery_executor hook.
  • Don't default to setting up a temporary RSQLite database.
  • Limit wrapr re-exports.
  • Add ability to use %:=% in parsing.
  • Move to executor options.
  • Separate execute() and materialize() roles.
  • Fix natural_join() semantics and make "by" a required argument.

rquery 0.4.4 2018/05/14

  • Fix if_else NA treatment.
  • Fix orderby() limit bug.
  • Better detection of column types: dbi_coltypes().
  • Improve execute() performance in simple case.
  • materialize_node().
  • Minor documentation fixes.

rquery 0.4.3 2018/05/07

  • Add assign_slice(), if_else_op(), map_column_values(), and set_indicator().
  • Bug fixes (esp on data.frame path).
  • Work on printing/formatting.
  • Improve column requirement checking.
  • Improve column ordering.
  • Make db_table() more prominent.
  • Documentation improvements.

rquery 0.4.2 2018/04/05

  • Work around RPostgreSQL issue (dbTableExists does not work).
  • More controls on dbi_ steps through package options.
  • Push LIMIT deeper into pipelines.
  • Add connection config tools.

rquery 0.4.1 2018/03/17

  • Fix re-run/re-create situations.
  • Pipe database connection.
  • Better quantile calc (assume window functions).
  • Improve dependency decls.
  • Further limit direct dependencies.
  • Work with more DB drivers.
  • SQL node for sets of columns.

rquery 0.4.0 2018/03/10

  • Move to wrapr 1.2.0
  • Add wrapr_function.relop().
  • Column checks on extend.
  • Default sql_node to copying incoming columns.
  • NULL column ops.
  • rsummary().
  • "Non SQL" nodes.
  • New rquery_intro vignette.
  • Improved help examples.
  • Stricter argument checking.

rquery 0.3.1 2018/02/10

  • Be compatible with both wrapr 1.1.1 and 1.2.0.

rquery 0.3.0 2018/01/31

  • Fix "desc".
  • Minor format change in rename (anticipate wrapr 1.2.0).
  • Get ready for S3 wrapr_applicable and pipe_step (un-block wrapr 1.2.0 release).

rquery 0.2.0 2017/01/22

  • Don't store DB reference.
  • Prepare for wrapr 1.1.0 features.
  • Ad-hoc query modes.
  • Fix statement partitioner.
  • More flexibility on assignment notation.

rquery 0.1.0 2017/12/03

  • Initial experiment

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


1.3.2 by John Mount, a month ago,

Report a bug at

Browse source code at

Authors: John Mount [aut, cre] , Win-Vector LLC [cph]

Documentation:   PDF Manual  

GPL-3 license

Imports wrapr, utils, stats, methods

Suggests DBI, RSQLite, rqdatatable, igraph, DiagrammeR, knitr, rmarkdown, RUnit

Imported by cdata.

Depended on by rqdatatable.

Suggested by vtreat.

See at CRAN