Data Preprocessing

Eases data preprocessing tasks, providing a data flow based on a pipe operator which eases cleansing, transformation, oversampling, or instance/feature selection operations.

BuildStatus minimal Rversion CRAN_Status_Badge packageversion

Package that integrates preprocessing algorithms for oversampling, instance/feature selection, normalization, discretization, space transformation, and outliers/missing values/noise cleaning.


You can install smartdata from github with:

# install.packages("devtools")

and load it into an R session with:



smartdata provides the following wrappers:

  • instance_selection
  • feature_selection
  • normalize
  • discretize
  • space_transformation
  • clean_outliers
  • impute_missing
  • clean_noise

To get the possible methods available for a certain wrapper, we can do:

#> Possible methods are: 'CNN', 'ENN', 'multiedit', 'FRIS'

To get information about the parameters available for a method:

which_options("instance_selection", "multiedit")
#> For more information do: ?class::multiedit 
#> Parameters for multiedit are: 
#>   * k: Number of neighbors used in KNN 
#>        Default value: 1 
#>   * num_folds: Number of partitions the train set is split in 
#>                Default value: 3 
#>   * null_passes: Number of null passes to use in the algorithm 
#>                  Default value: 5

First let’s load a bunch of datasets:

data(iris0,  package = "imbalance")
data(ecoli1, package = "imbalance")
data(nhanes, package = "mice")


super_iris <- iris0 %>% oversample(method = "MWMOTE", ratio = 0.8, filtering = TRUE)

Instance selection

super_iris <- iris %>% instance_selection("multiedit", k = 3, num_folds = 2, 
                                          null_passes = 10, class_attr = "Species")

Feature selection

super_ecoli <- ecoli1 %>% feature_selection("Boruta", class_attr = "Class")


super_iris <- iris %>% normalize("min_max", exclude = c("Sepal.Length", "Species"))


super_iris <- iris %>% discretize("ameva", class_attr = "Species")

Space transformation

super_ecoli <- ecoli1 %>% space_transformation("lle_knn", k = 3, num_features = 2)


super_iris <- iris %>% clean_outliers("multivariate", type = "adj")

Missing values

super_nhanes <- nhanes %>% impute_missing("gibbs_sampling")


super_iris <- iris %>% clean_noise("hybrid", class_attr = "Species", 
                                   consensus = FALSE, action = "repair")


smartdata 1.0.2

  • Corrects a bug for the instance selection wrapper applied for methods information_gain, gain_ratio and sym_uncertainty which gave failure when there was no categorical attribute in the dataset, apart from the class one.
  • Changes num_attrs parameter for num_features in feature selection to standardize parameters w.r.t. space transformation wrapper.
  • Corrects a bug for lle space transformation: result was a matrix instead of a dataset.

smartdata 1.0.1

  • Corrects the titles of the vignettes so they appear correctly on CRAN
  • Corrects issue compiling one of the vignettes (issue regarding fancyvrb / xcolor with options)

smartdata 1.0.0

  • First release
  • Methods for instance selection, feature selection, normalization, discretization, space transformation, clean outliers, impute missing values or clean noise instances

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


1.0.3 by Ignacio Cordón, a year ago

Report a bug at

Browse source code at

Authors: Ignacio Cordón [aut, cre] , Francisco Charte [aut] , Julián Luengo [aut] , Salvador García [aut] , Francisco Herrera [aut]

Documentation:   PDF Manual  

GPL (>= 2) | file LICENSE license

Imports functional, checkmate, magrittr, infotheo, MVN, adaptiveGPCA, discretization, outliers, NoiseFiltersR, Boruta, FSelector, lle, unbalanced, RoughSets, class, clusterSim, Amelia, imbalance, DMwR, missForest, missMDA, denoiseR, VIM

Depends on mice

Suggests testthat, mlbench, rpart, knitr, rmarkdown

See at CRAN