A recipe prepares your data for modeling. We provide an extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data. Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets. The resulting processed output can then be used as inputs for statistical or machine learning models.
Several argument names were changed to be consistent with other
tidymodels packages (e.g.
dials) and the general tidyverse naming conventions.
step_knnimputewas changed to
step_isomaphad the number of neighbors promoted to a main argument called
nbaggout of the options and into a main argument
step_nshas degrees of freedom promoted to a main argument with name
degreepromoted to a main argument.
juiceand other functions has
new_data. For this version only, using
newdatawill only result in a wanring.
prepand a few steps had
add_role() can now only add new additional roles. To alter existing roles, use
update_role(). This change also allows for the possibility of having multiple roles/types for one variable. #221
All steps gain an
id field that will be used in the future to reference other steps.
retain option to
prep is now defaulted to
verbose = TRUE, the approximate size of the data set is printed. #207
step_integerconverts data to ordered integers similar to
LabelEncoder#123 and #185
step_geodistcan be used to calculate the distance between geocodes and a single reference location.
step_nnmfcomputes the non-negative matrix factorization for data.
prepperwas moved to
step_step_string2factorwill now accept factors and leave them as-is.
step_knnimputenow excludes missing data in the variable to be imputed from the nearest-neighbor calculation. This would have resulted in some missing data to not be imputed (i.e. return another missing value).
step_dummynow produces a warning (instead of failing) when non-factor columns are selected. Only factor columns are used; no conversion is done for character data. issue #186
dummy_namesgained a separator argument. issue #183
seedarguments for more control over randomness.
broomis no longer used to get the
tidygeneric. These are now contained in the
bake if variable range in new data is outside the range that was learned from the train set (contributed by Edwin Thoen)
step_lag can lag variables in the data set (contributed by Alex Hayes).
step_naomit removes rows with missing data for specific columns (contributed by Alex Hayes).
step_rollimpute can be used to impute data in a sequence or series by estimating their values within a moving window.
step_pls can conduct supervised feature extraction for predictors.
step_log gained an
step_log gained a
signed argument (contributed by Edwin Thoen).
The internal functions
printer have been exported to enable other packages to contain steps.
When training new steps after some steps have been previously trained, the
retain = TRUE option should be set on previous invocations of
one_hot = TRUEoption. Thanks to Davis Vaughan.
contrastoption was removed. The step uses the global option for contrasts.
step_other will now convert novel levels of the factor to the "other" level.
step_bin2factor now has an option to choose how the values are translated to the levels (contributed by Michael Levy).
juice can now export basic data frames.
okc data were updated with two additional columns.
issue 125 that prevented several steps from working with dplyr grouped data frames. (contributed by Jeffrey Arnold)
issue 127 where options to
step_discretize were not being passed to
Edwin Thoen suggested adding validation checks for certain data characteristics. This fed into the existing notion of expanding
recipes beyond steps (see the non-step steps project). A new set of operations, called
checks, can now be used. These should throw an informative error when the check conditions are not met and return the existing data otherwise.
Steps now have a
skip option that will not apply preprocessing when
bake is used. See the article on skipping steps for more information.
check_missing will validate that none of the specified variables contain missing data.
detect_step can be used to check if a recipe contains a particular preprocessing operation.
step_num2factor can be used to convert numeric data (especially integers) to factors.
step_novel adds a new factor level to nominal variables that will be used when new data contain a level that did not exist when the recipe was prepared.
step_profile can be used to generate design matrix grids for prediction profile plots of additive models where one variable is varied over a grid and all of the others are fixed at a single value.
step_upsample can be used to change the number of rows in the data based on the frequency distributions of a factor variable in the training set. By default, this operation is only applied to the training set;
bake ignores this operation.
step_naomit drops rows when specified columns contain
NA, similar to
step_lag allows for the creation of lagged predictor columns.
step_spatialsignnow has the option of removing missing data prior to computing the norm.
bakewas changed from
prepis now defaulted to
step_dummywas fixed that makes sure that the correct binary variables are generated despite the levels or values of the incoming factor. Also,
step_dummynow requires factor inputs.
step_dummyalso has a new default naming function that works better for factors. However, there is an extra argument (
ordinal) now to the functions that can be passed to
step_interactnow allows for selectors (e.g.
starts_with("prefix")to be used in the interaction formula.
dplyr::one_ofwas added to the list of selectors.
step_bsadds B-spline basis functions.
step_unorderconverts ordered factors to unordered factors.
step_countcounts the number of instances that a pattern exists in a string.
step_factor2stringcan be used to move between encodings.
step_lowerimputeis for numeric data where the values cannot be measured below a specific value. For these cases, random uniform values are used for the truncated values.
tidymethods were added for recipes and many (but not all) steps.
bake.recipe, the argument
newdatais now without a default.
juicecan now save the final processed data set in sparse format. Note that, as the steps are processed, a non-sparse data frame is used to store the results.
First CRAN release.
prepper issue #59
step_lincombremoves variables involved in linear combinations to resolve them.
step_regexapplies a regular expression to a character or factor vector to create dummy variables.
step_interactdo a better job of respecting missing values in the data set.
recipeobjects was changed so that pipes can be used to create the recipe with a formula.
roleargument in factor of a general set of selectors. If no selector is used, all the predictors are returned.