Data Preprocessing

Eases data preprocessing tasks, providing a data flow based on a pipe operator which eases cleansing, transformation, oversampling, or instance/feature selection operations.


BuildStatus minimal Rversion CRAN_Status_Badge packageversion

Package that integrates preprocessing algorithms for oversampling, instance/feature selection, normalization, discretization, space transformation, and outliers/missing values/noise cleaning.

Installation

You can install smartdata from github with:

# install.packages("devtools")
devtools::install_github("ncordon/smartdata")

and load it into an R session with:

library("smartdata")

Examples

smartdata provides the following wrappers:

  • instance_selection
  • feature_selection
  • normalize
  • discretize
  • space_transformation
  • clean_outliers
  • impute_missing
  • clean_noise

To get the possible methods available for a certain wrapper, we can do:

which_options("instance_selection")
#> Possible methods are: 'CNN', 'ENN', 'multiedit', 'FRIS'

To get information about the parameters available for a method:

which_options("instance_selection", "multiedit")
#> For more information do: ?class::multiedit 
#> Parameters for multiedit are: 
#>   * k: Number of neighbors used in KNN 
#>        Default value: 1 
#>   * num_folds: Number of partitions the train set is split in 
#>                Default value: 3 
#>   * null_passes: Number of null passes to use in the algorithm 
#>                  Default value: 5

First let’s load a bunch of datasets:

data(iris0,  package = "imbalance")
data(ecoli1, package = "imbalance")
data(nhanes, package = "mice")

Oversampling

super_iris <- iris0 %>% oversample(method = "MWMOTE", ratio = 0.8, filtering = TRUE)

Instance selection

super_iris <- iris %>% instance_selection("multiedit", k = 3, num_folds = 2, 
                                          null_passes = 10, class_attr = "Species")

Feature selection

super_ecoli <- ecoli1 %>% feature_selection("Boruta", class_attr = "Class")

Normalization

super_iris <- iris %>% normalize("min_max", exclude = c("Sepal.Length", "Species"))

Discretization

super_iris <- iris %>% discretize("ameva", class_attr = "Species")

Space transformation

super_ecoli <- ecoli1 %>% space_transformation("lle_knn", k = 3, num_features = 2)

Outliers

super_iris <- iris %>% clean_outliers("multivariate", type = "adj")

Missing values

super_nhanes <- nhanes %>% impute_missing("gibbs_sampling")

Noise

super_iris <- iris %>% clean_noise("hybrid", class_attr = "Species", 
                                   consensus = FALSE, action = "repair")

News

smartdata 1.0.2

  • Corrects a bug for the instance selection wrapper applied for methods information_gain, gain_ratio and sym_uncertainty which gave failure when there was no categorical attribute in the dataset, apart from the class one.
  • Changes num_attrs parameter for num_features in feature selection to standardize parameters w.r.t. space transformation wrapper.
  • Corrects a bug for lle space transformation: result was a matrix instead of a dataset.

smartdata 1.0.1

  • Corrects the titles of the vignettes so they appear correctly on CRAN
  • Corrects issue compiling one of the vignettes (issue regarding fancyvrb / xcolor with options)

smartdata 1.0.0

  • First release
  • Methods for instance selection, feature selection, normalization, discretization, space transformation, clean outliers, impute missing values or clean noise instances

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("smartdata")

1.0.2 by Ignacio Cordón, 3 months ago


http://github.com/ncordon/smartdata


Report a bug at http://github.com/ncordon/smartdata/issues


Browse source code at https://github.com/cran/smartdata


Authors: Ignacio Cordón [aut, cre] , Francisco Charte [aut] , Julián Luengo [aut] , Salvador García [aut] , Francisco Herrera [aut]


Documentation:   PDF Manual  


GPL (>= 2) | file LICENSE license


Imports functional, checkmate, magrittr, infotheo, MVN, adaptiveGPCA, discretization, outliers, NoiseFiltersR, Boruta, FSelector, lle, unbalanced, RoughSets, class, clusterSim, Amelia, imbalance, DMwR, missForest, missMDA, denoiseR, VIM

Depends on mice

Suggests testthat, mlbench, rpart, knitr, rmarkdown


See at CRAN