Read in 'SAS' Data ('.sas7bdat' Files) into 'Apache Spark' from R. 'Apache Spark' is an open source cluster computing framework available at < http://spark.apache.org>. This R package uses the 'spark-sas7bdat' 'Spark' package (< https://spark-packages.org/package/saurfang/spark-sas7bdat>) to import and process 'SAS' data in parallel using 'Spark'. Hereby allowing to execute 'dplyr' statements in parallel on top of 'SAS' data.
The spark.sas7bdat package allows R users working with Apache Spark to read in SAS datasets in .sas7bdat format into Spark by using the spark-sas7bdat Spark package. This allows R users to
The following example reads in a file called iris.sas7bdat in a table called sas_example in Spark. Do try this with bigger data on your cluster and look at the help of the sparklyr package to connect to your Spark cluster.
library(sparklyr)library(spark.sas7bdat)mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat") sc <- spark_connect(master = "local")x <- spark_read_sas(sc, path = mysasfile, table = "sas_example")x
The resulting pointer to a Spark table can be further used in dplyr statements
library(dplyr)x %>% group_by(Species) %>% summarise(count = n(), length = mean(Sepal_Length), width = mean(Sepal_Width))
Install the package from CRAN.
install.packages('spark.sas7bdat')
Or install this development version from github.
devtools::install_github("bnosac/spark.sas7bdat", build_vignettes = TRUE)
vignette("spark_sas7bdat_examples", package = "spark.sas7bdat")
The package has been tested out with Spark version 2.0.1 and Hadoop 2.7.
library(sparklyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")
In order to compare the functionality to the read_sas function from the haven package, below we show a comparison on a small 5234557 rows x 2 columns SAS dataset with only numeric data. Processing is done on 8 cores. With the haven package you need to import the data in RAM, with the spark.sas7bdat package, you can immediately execute dplyr statements on top of the SAS dataset.
mysasfile <- "/home/bnosac/Desktop/testdata.sas7bdat"system.time(x <- spark_read_sas(sc, path = mysasfile, table = "testdata")) user system elapsed 0.008 0.000 0.051 system.time(x <- haven::read_sas(mysasfile)) user system elapsed 1.172 0.032 1.200
Need support in big data and Spark analysis? Contact BNOSAC: http://www.bnosac.be
CHANGES IN spark.sas7bdat VERSION 1.2
o Use [email protected] in DESCRIPTION instead of Author
CHANGES IN spark.sas7bdat VERSION 1.1
o Use saurfang:spark-sas7bdat:1.1.4 in case of Scala version 2.10 and saurfang:spark-sas7bdat:1.1.5 for other Scala versions. Allowing to use this package with Spark 2.0.1
CHANGES IN spark.sas7bdat VERSION 1.0
o Initial version of the package