Google 'BigQuery' Support for 'sparklyr'

A 'sparklyr' extension package providing an integration with Google 'BigQuery'. It supports direct import/export where records are directly streamed from/to 'BigQuery'. In addition, data may be imported/exported via intermediate data extracts on Google 'Cloud Storage'.


CRAN_Status_Badge

sparkbq is a sparklyr extension package providing an integration with Google BigQuery. It builds on top of spark-bigquery, which provides a Google BigQuery data source to Apache Spark.

Version Information

You can install the released version of sparkbq from CRAN via

install.packages("sparkbq")

or the latest development version through

devtools::install_github("miraisolutions/sparkbq", ref = "develop")

The following table provides an overview over supported versions of Apache Spark, Scala, and Google Dataproc:

sparkbq spark-bigquery Apache Spark Scala Google Dataproc
0.1.x 0.1.0 2.2.x and 2.3.x 2.11 1.2.x and 1.3.x

sparkbq is based on the Spark package spark-bigquery which is available in a separate GitHub repository.

Example Usage

library(sparklyr)
library(sparkbq)
library(dplyr)
 
config <- spark_config()
 
sc <- spark_connect(master = "local[*]", config = config)
 
# Set Google BigQuery default settings
bigquery_defaults(
  billingProjectId = "<your_billing_project_id>",
  gcsBucket = "<your_gcs_bucket>",
  datasetLocation = "US",
  serviceAccountKeyFile = "<your_service_account_key_file>",
  type = "direct"
)
 
# Reading the public shakespeare data table
https://cloud.google.com/bigquery/public-data/
https://cloud.google.com/bigquery/sample-tables
hamlet <- 
  spark_read_bigquery(
    sc,
    name = "hamlet",
    projectId = "bigquery-public-data",
    datasetId = "samples",
    tableId = "shakespeare") %>%
  filter(corpus == "hamlet") # NOTE: predicate pushdown to BigQuery!
  
# Retrieve results into a local tibble
hamlet %>% collect()
 
# Write result into "mysamples" dataset in our BigQuery (billing) project
spark_write_bigquery(
  hamlet,
  datasetId = "mysamples",
  tableId = "hamlet",
  mode = "overwrite")

Authentication

When running outside of Google Cloud it is necessary to specify a service account JSON key file. Information on how to generate service account credentials can be found at https://cloud.google.com/storage/docs/authentication#service_accounts. The service account key file can either be passed as parameter serviceAccountKeyFile to bigquery_defaults or directly to spark_read_bigquery and spark_write_bigquery. Alternatively, an environment variable export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service_account_keyfile.json can be set (see https://cloud.google.com/docs/authentication/getting-started for more information). When running on Google Cloud, e.g. Google Cloud Dataproc, application default credentials (ADC) may be used in which case it is not necessary to specify a service account key file.

Further Information

News

0.1.0 2018-07-31

  • Initial Release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("sparkbq")

0.1.0 by Martin Studer, a year ago


http://www.mirai-solutions.com, https://github.com/miraisolutions/sparkbq


Report a bug at https://github.com/miraisolutions/sparkbq/issues


Browse source code at https://github.com/cran/sparkbq


Authors: Mirai Solutions GmbH [aut] , Martin Studer [cre] , Nicola Lambiase [ctb] , Omer Demirel [ctb]


Documentation:   PDF Manual  


Task views: Web Technologies and Services


GPL-3 | file LICENSE license


Imports sparklyr

Suggests dplyr

System requirements: Spark (>= 2.2.x)


See at CRAN