Interface for 'XGBoost' on 'Apache Spark'

A 'sparklyr' < https://spark.rstudio.com/> extension that provides an interface for 'XGBoost' < https://github.com/dmlc/xgboost> on 'Apache Spark'. 'XGBoost' is an optimized distributed gradient boosting library.


Travis buildstatus

Overview

sparkxgb is a sparklyr extension that provides an interface to XGBoost on Spark.

Installation

You can install sparkxgb from CRAN with:

install.packages("sparkxgb")

You can install the development version of sparkxgb with:

devtools::install_github("rstudio/sparkxgb")

Example

sparkxgb supports the familiar formula interface for specifying models:

library(sparkxgb)
library(sparklyr)
library(dplyr)
 
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris)
 
xgb_model <- xgboost_classifier(
  iris_tbl, 
  Species ~ .,
  num_class = 3,
  num_round = 50, 
  max_depth = 4
)
 
xgb_model %>%
  ml_predict(iris_tbl) %>%
  select(Species, predicted_label, starts_with("probability_")) %>%
  glimpse()
#> Observations: ??
#> Variables: 5
#> Database: spark_connection
#> $ Species                <chr> "setosa", "setosa", "setosa", "setosa", "…
#> $ predicted_label        <chr> "setosa", "setosa", "setosa", "setosa", "…
#> $ probability_versicolor <dbl> 0.003566429, 0.003564076, 0.003566429, 0.…
#> $ probability_virginica  <dbl> 0.001423170, 0.002082058, 0.001423170, 0.…
#> $ probability_setosa     <dbl> 0.9950104, 0.9943539, 0.9950104, 0.995010…

It also provides a Pipelines API, which means you can use a xgboost_classifier or xgboost_regressor in a pipeline as any Estimator, and do things like hyperparameter tuning:

pipeline <- ml_pipeline(sc) %>%
  ft_r_formula(Species ~ .) %>%
  xgboost_classifier(num_class = 3)
 
param_grid <- list(
  xgboost = list(
    max_depth = c(1, 5),
    num_round = c(10, 50)
  )
)
 
cv <- ml_cross_validator(
  sc,
  estimator = pipeline,
  evaluator = ml_multiclass_classification_evaluator(
    sc, 
    label_col = "label",
    raw_prediction_col = "rawPrediction"
  ),
  estimator_param_maps = param_grid
)
 
cv_model <- cv %>%
  ml_fit(iris_tbl)
 
summary(cv_model)
#> Summary for CrossValidatorModel 
#>             <cross_validator_89bd7c0fec9a> 
#> 
#> Tuned Pipeline
#>   with metric f1
#>   over 4 hyperparameter sets 
#>   via 3-fold cross validation
#> 
#> Estimator: Pipeline
#>            <pipeline_89bd45fd58cc> 
#> Evaluator: MulticlassClassificationEvaluator
#>            <multiclass_classification_evaluator_89bd3a1fc63d> 
#> 
#> Results Summary: 
#>          f1 num_round_1 max_depth_1
#> 1 0.9549670          10           1
#> 2 0.9674460          10           5
#> 3 0.9488665          50           1
#> 4 0.9613854          50           5

News

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("sparkxgb")

0.1.0 by Kevin Kuo, 8 months ago


Browse source code at https://github.com/cran/sparkxgb


Authors: Kevin Kuo [aut, cre]


Documentation:   PDF Manual  


Apache License (>= 2.0) license


Imports sparklyr, forge

Suggests testthat, dplyr, purrr, rlang

System requirements: Apache Spark 2.3+


See at CRAN