Read in 'SAS' Data ('.sas7bdat' Files) into 'Apache Spark'

Read in 'SAS' Data ('.sas7bdat' Files) into 'Apache Spark' from R. 'Apache Spark' is an open source cluster computing framework available at < http://spark.apache.org>. This R package uses the 'spark-sas7bdat' 'Spark' package (< https://spark-packages.org/package/saurfang/spark-sas7bdat>) to import and process 'SAS' data in parallel using 'Spark'. Hereby allowing to execute 'dplyr' statements in parallel on top of 'SAS' data.


The spark.sas7bdat package allows R users working with Apache Spark to read in SAS datasets in .sas7bdat format into Spark by using the spark-sas7bdat Spark package. This allows R users to

  • load a SAS dataset in parallel into a Spark table for further processing with the sparklyr package
  • process in parallel the full SAS dataset with dplyr statements, instead of having to import the full SAS dataset in RAM (using the foreign/haven packages) and hence avoiding RAM problems of large imports

Example

The following example reads in a file called iris.sas7bdat in a table called sas_example in Spark. Do try this with bigger data on your cluster and look at the help of the sparklyr package to connect to your Spark cluster.

library(sparklyr)
library(spark.sas7bdat)
mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
 
sc <- spark_connect(master = "local")
x <- spark_read_sas(sc, path = mysasfile, table = "sas_example")
x

The resulting pointer to a Spark table can be further used in dplyr statements

library(dplyr)
x %>% group_by(Species) %>%
  summarise(count = n(), length = mean(Sepal_Length), width = mean(Sepal_Width))

Installation

Install the package from CRAN.

install.packages('spark.sas7bdat')

Or install this development version from github.

devtools::install_github("bnosac/spark.sas7bdat", build_vignettes = TRUE)
vignette("spark_sas7bdat_examples", package = "spark.sas7bdat")

The package has been tested out with Spark version 2.0.1 and Hadoop 2.7.

library(sparklyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")

Speed comparison

In order to compare the functionality to the read_sas function from the haven package, below we show a comparison on a small 5234557 rows x 2 columns SAS dataset with only numeric data. Processing is done on 8 cores. With the haven package you need to import the data in RAM, with the spark.sas7bdat package, you can immediately execute dplyr statements on top of the SAS dataset.

mysasfile <- "/home/bnosac/Desktop/testdata.sas7bdat"
system.time(x <- spark_read_sas(sc, path = mysasfile, table = "testdata"))
   user  system elapsed 
  0.008   0.000   0.051 
system.time(x <- haven::read_sas(mysasfile))
   user  system elapsed 
  1.172   0.032   1.200 

Support in big data and Spark analysis

Need support in big data and Spark analysis? Contact BNOSAC: http://www.bnosac.be

News

            CHANGES IN spark.sas7bdat VERSION 1.2

o    Use [email protected] in DESCRIPTION instead of Author

            CHANGES IN spark.sas7bdat VERSION 1.1

o    Use saurfang:spark-sas7bdat:1.1.4 in case of Scala version 2.10 and saurfang:spark-sas7bdat:1.1.5 for other Scala versions. Allowing to use this package with Spark 2.0.1

            CHANGES IN spark.sas7bdat VERSION 1.0

o    Initial version of the package

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("spark.sas7bdat")

1.2 by Jan Wijffels, 3 years ago


https://github.com/bnosac/spark.sas7bdat


Browse source code at https://github.com/cran/spark.sas7bdat


Authors: Jan Wijffels [aut, cre, cph] , BNOSAC [cph]


Documentation:   PDF Manual  


GPL-3 license


Imports sparklyr

Suggests knitr


See at CRAN