Automated Data Preparation

Do most of the painful data preparation for a data science project with a minimum amount of code; Take advantages of data.table efficiency and use some algorithmic trick in order to perform data preparation in a time and RAM efficient way.


News

V 0.3.8

  • New features:
    • New features in existing functions:
      • Identification of bijection through internal function fastIsBijection is way faster (up to 40 times faster in case of bijection). So whichArebijection and fastFiltervariables are also improved.
      • Remove remaining gc to save time.
      • In one_hot_encoder added parameter type to make choise between logical or numerical results.

V 0.3.7

  • New features:

    • New functions:
      • Function as.POSIXct_fast is now available. It helps to transform to POSIXct way faster (if the same date value is present multiple times in the column).
    • New features in existing functions:
      • In dates identifications, we make it faster by computing search of format only on unique values.
      • In date transformation, we made it faster by using as.POSIXct_fast when it is necessary.
      • Functions findAndTransFormDates, findAndTransformNumerics and unFactor now accept argument cols to limitate search.
  • Bug fixes:

    • Control that over-allocate option is activated on every data.table to avoid issues with set. Package should be more robust.
    • In bijection search (internal function fastIsBijection) there was a bug on some rare cases. Fixed but slower.

-Code quality: - Improving code quality using lintr - Suppressing some useless code - Meeting new covr standard - Improve log of setColAsXXX

V 0.3.6

  • Bug fixes:

    • identifyDates had a weird bug. Solved
  • Integration:

    • Making dataPreparation compatible with testthat 2.0.0

V 0.3.5

  • New features:

    • New features in existing functions:
      • findAndTransFormDates now as an ambiguities parameter, IGNORE to work as before, WARN to check for ambiguities and print them, SOLVE to try to solve ambiguities on more lines.
      • one_hot_encoder now uses a build_encoding functions to be able to build same encoding on train and on test.
      • aggregateByKey is now way faster on numerics. But it changed the way it gets input functions.
      • fastScale now as a way parameter which allow you to either scale or unscale. Unscaling numeric values can be very usefull for most post-model analysis.
      • setColAsDate now accept multiple formats in a single call.
    • New functions:
      • build_encoding build a list of encoding to be used by one_hot_encoder, it also has a parameter min_frequency to control that rare values doesn't result in new columns.
      • Previously private function identifyDates is now exported. To be able to perform same transformation on train and on test.
      • Adding dataPrepNews function to open NEWS file (inspired from rfNews() of randomForest package)
  • Bug fixes:

    • findAndTransFormDates: bug fixed: user formats weren't used.
    • identifyDates: some formats where tested but would never work. They have been removed.
  • Refactoring:

    • Unit test partly reviewed to be more readable and more efficient. Unit test time as been divided by 3.
    • Improving input control for more robust functions

WARNING:

  • one_hot_encoder now requires you to run build_encoding first.
  • aggregateByKey now require functions to be passed by character name

This version is making (as much as possible) transformation reproducible on train and test set. This is to prepare future pipeline feature.

V 0.3.4

  • Improvement of function
    • whichAreBijection: It is 2 to 15 time faster than previous version.
    • whichAreIncluded: It is a bit faster.
  • Bug fixes:
    • generateFactorFromDate: default value was missing. Fixed.
  • New features:
    • New features in existing functions:
      • fastFilterVariables has a new parameter (level) to choose which types of filtering to perform

WARNING:

  • whichAreIncluded: in case of bijection (col1 is a bijection of col2), they are both included in the other, but the choice of the one to drop might have changed in this version.

V 0.3.3

  • New features:
    • New features in existing functions:
      • findAndTransFormDates now recognize date character even if there are multiple separator in date (ex: "2016, Jan-26").
      • findAndTransFormDates now recognize date character even if there are leading and tailing white spaces.

WARNING:

  • date3 column in messy_adult data set has changed in order to illustrate the recognition of date character even if there are leading and/or trailing white spaces.
  • date4 column in messy_adult data set has changed in order to illustrate the recognition of date character even if there are multiple separator.

V 0.3.2

  • Change URLs to meet CRAN requirement

v 0.3.1

  • Fix bug in Latex documentation

v 0.3

  • New features:
    • New features in existing functions:
      • findAndTransFormDates now recognize date character even if "0" are not present in month or day part and month as lower strings.
      • findAndTransFormDates and setColAsDate now work with factors.
    • New functions:
      • fastDiscretization: to perform equal freq or equal width discretization on a data set using data.table power.
      • fastScale: to perform scaling on a data set using data.table power.
      • one_hot_encoder: to perform one_hot encoding on a data set using data.table power.
    • New documentation:
      • A new vignette to illustrate how to build a correct train and test set unising data preparation
  • Minor changes in log (in particular regarding progress bars and typos)
    • Due to dependencies issues with tcltk, we stop using it and start using progress
  • Refactoring:
    • Private function real_cols take more importance to control that columns have the correct types and handling "auto" value.
    • Making code faster: some functions are up to 30% faster
    • Review unit testing to be faster
    • Unit test evolution to be more readable

WARNING:

  • date1 column in messy_adult data set has changed in order to illustrate the recognition of date character even if "0" are not present in month or day part.

v 0.2

  • Improving unit testing and code coverage
  • Improving documentation
  • Solving minor bug in date conversion and in which functions
  • New features:
    • New functions:

      • unFactor to unfactor columns, when reading wasn't performed in expected way.

      • sameShape to make ure that train and test set have exactly the same shape.

      • generate new columns from existing columns (generate functions)

        • generate factor from dates: generateFactorFromDate
        • diffDates becomes generateDateDiffs (for better name understanding).
        • generate numerics and booleans from character of fators (using generateFromFactor and generateFromCharacter)
      • setColAsFactor a function to make multiple columns as factor and controling number of unique elements

    • New features in existing functions:

      • which functions: add keep_cols argument to make sure that they are not dropped
      • fastFilterVariables: verbose can be T/F or 0, 1, 2 in order to control level of verbosity
      • findAndTransFormDates and setColAsDates now recognize and accept timestamp.

WARNING:

  • If you were using diffDates, it is now called generateDateDiffs
  • date2 column in messy_adult data set have changed in order to illustrate new timestamp features
  • setColAsFactorOrLogical doesn't exist anymore: it as been splitted between setColAsFactor and generateFromCat
  • Considering all those changes: shapeSet and prepareSet don't give the same result anymore.

v 0.1: release on CRAN July 2017

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("dataPreparation")

0.3.9 by Emmanuel-Lin Toulemonde, 2 months ago


Report a bug at https://github.com/ELToulemonde/dataPreparation/issues


Browse source code at https://github.com/cran/dataPreparation


Authors: Emmanuel-Lin Toulemonde [aut, cre]


Documentation:   PDF Manual  


GPL-3 | file LICENSE license


Imports data.table

Depends on lubridate, stringr, Matrix, progress

Suggests knitr, rmarkdown, kableExtra, pander, testthat


See at CRAN