Data Preprocessing

Eases data preprocessing tasks, providing a data flow based on a pipe operator which eases cleansing, transformation, oversampling, or instance/feature selection operations.


BuildStatus minimal Rversion packageversion

Package that integrates preprocessing algorithms for oversampling, instance/feature selection, normalization, discretization, space transformation, and outliers/missing values/noise cleaning.

Installation

You can install smartdata from github with:

# install.packages("devtools")
devtools::install_github("ncordon/smartdata")

and load it into an R session with:

library("smartdata")

Examples

smartdata provides the following wrappers:

  • instance_selection
  • feature_selection
  • normalize
  • discretize
  • space_transformation
  • clean_outliers
  • impute_missing
  • clean_noise

To get the possible methods available for a certain wrapper, we can do:

which_options("instance_selection")
#> Possible methods are: 'CNN', 'ENN', 'multiedit', 'FRIS'

To get information about the parameters available for a method:

which_options("instance_selection", "multiedit")
#> For more information do: ?class::multiedit 
#> Parameters for multiedit are: 
#>   * k: Number of neighbors used in KNN 
#>        Default value: 1 
#>   * num_folds: Number of partitions the train set is split in 
#>                Default value: 3 
#>   * null_passes: Number of null passes to use in the algorithm 
#>                  Default value: 5

First let’s load a bunch of datasets:

data(iris0,  package = "imbalance")
data(ecoli1, package = "imbalance")
data(nhanes, package = "mice")

Oversampling

super_iris <- iris0 %>% oversample(method = "MWMOTE", ratio = 0.8, filtering = TRUE)

Instance selection

super_iris <- iris %>% instance_selection("multiedit", k = 3, num_folds = 2, 
                                          null_passes = 10, class_attr = "Species")

Feature selection

super_ecoli <- ecoli1 %>% feature_selection("Boruta", class_attr = "Class")

Normalization

super_iris <- iris %>% normalize("min_max", exclude = c("Sepal.Length", "Species"))

Discretization

super_iris <- iris %>% discretize("ameva", class_attr = "Species")

Space transformation

super_ecoli <- ecoli1 %>% space_transformation("lle_knn", k = 3, num_features = 2)

Outliers

super_iris <- iris %>% clean_outliers("multivariate", type = "adj")

Missing values

super_nhanes <- nhanes %>% impute_missing("gibbs_sampling")

Noise

super_iris <- iris %>% clean_noise("hybrid", class_attr = "Species", 
                                   consensus = FALSE, action = "repair")

News

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("smartdata")

1.0.2 by Ignacio Cordón, a month ago


http://github.com/ncordon/smartdata


Report a bug at http://github.com/ncordon/smartdata/issues


Browse source code at https://github.com/cran/smartdata


Authors: Ignacio Cordón [aut, cre] , Francisco Charte [aut] , Julián Luengo [aut] , Salvador García [aut] , Francisco Herrera [aut]


Documentation:   PDF Manual  


GPL (>= 2) | file LICENSE license


Imports functional, checkmate, magrittr, infotheo, MVN, adaptiveGPCA, discretization, outliers, NoiseFiltersR, Boruta, FSelector, lle, unbalanced, RoughSets, class, clusterSim, Amelia, imbalance, DMwR, missForest, missMDA, denoiseR, VIM

Depends on mice

Suggests testthat, mlbench, rpart, knitr, rmarkdown


See at CRAN