Automate Data Exploration and Treatment

Automated data exploration process for analytic tasks and predictive modeling, so that users could focus on understanding data and extracting insights. The package scans and analyzes each variable, and visualizes them with typical graphical techniques. Common data processing methods are also available to treat and format data.


CRAN Version CII Best Practices Downloads Total Downloads

master v0.8.0

Travis Build Status AppVeyor Build Status codecov

develop v0.8.0.9000

Travis Build Status AppVeyor Build Status codecov


Background

Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

However, the latest stable version (if any) could be found on GitHub, and installed using remotes package.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer")

If you would like to install the latest development version, you may install the dev branch.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer", ref = "develop")

Examples

The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.

Report

To get a report for the airquality dataset:

library(DataExplorer)
create_report(airquality)

To get a report for the diamonds dataset with response variable price:

library(ggplot2)
create_report(diamonds, y = "price")

Visualization

You may also run all the plotting functions individually for your analysis, e.g.,

## View basic description for airquality data
introduce(airquality)
plot_intro(airquality)
 
## View missing value distribution for airquality data
plot_missing(airquality)
 
## View distribution of all discrete variables
plot_bar(diamonds)
plot_bar(diamonds, with = "price")
 
## View distribution of all continuous variables
plot_histogram(diamonds)
plot_density(diamonds)
 
## View quantile-quantile plot of all continuous variables
plot_qq(diamonds)
plot_qq(diamonds, by = "cut")
 
## View overall correlation heatmap
plot_correlation(diamonds)
 
## View bivariate continuous distribution based on `price`
plot_boxplot(diamonds, by = "cut")
    
## Scatterplot `price` with all other continuous features
plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L)
 
## Visualize principal component analysis
plot_prcomp(diamonds, maxcat = 5L)

Feature Engineering

To make quick updates to your data:

## Group bottom 20% `clarity` by frequency
group_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE)
 
## Group bottom 20% `clarity` by `price`
group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE)
 
## Dummify diamonds dataset
dummify(diamonds)
dummify(diamonds, select = "cut")
 
## Set values for missing observations
df <- data.frame("a" = rnorm(260), "b" = rep(letters, 10))
df[sample.int(260, 50), ] <- NA
set_missing(df, list(0L, "unknown"))
 
## Update columns
update_columns(airquality, c("Month", "Day"), as.factor)
update_columns(airquality, 1L, function(x) x^2)
 
## Drop columns
drop_columns(diamonds, 8:10)
drop_columns(diamonds, "clarity")

Articles

See article wiki page.

News

DataExplorer 0.8.0

New Features

  • #92: Added update_columns to transform any selected columns.

Enhancements

  • #87: Added configure_report function to customize report content.
  • #89: Added option to customize geom_text and geom_label arguments.
  • #91: create_report now displays full report directory after completion.
  • #95: Added better exception handling for plot_bar.
  • #98: Added band customization to plot_missing.
  • #100: Switched geom_text to geom_label.
  • #103: Report title can now be customized in create_report.
  • #108: Added option to treat binary features as discrete in plot_bar, plot_histogram, plot_density and plot_boxplot.
  • Updated d3.min.js to v5.9.2.

Bug Fixes

  • #88: Added plot_intro to report config.
  • #90: Added first plot in plot_prcomp to output and page_0.
  • #94: Fixed typo for PCA.

DataExplorer 0.7.1

Enhancements

  • #86: Replaced gridExtra::grid.arrange with facets.
  • Added seeds to vignette and README for re-producible examples.
  • Hid all internal functions.

DataExplorer 0.7.0

New Features

  • #72: Added plot_qq for QQ plot.
  • #76: Added plot_intro to visualize results of introduce.

Enhancements

  • #42: Applied S3 methods for plotting functions.
  • #77: dummify now works on selected columns.
  • #78: All ggplot objects from plot_* are now invisibly returned. As a result, extracted profile_missing from plot_missing for missing value profiles.
  • #83: Removed all deprecated functions.
  • #85: Users can now specify number of rows/columns for plot page layout.
  • plot_prcomp now passed scale. = TRUE to prcomp by default.
  • Added sampled_rows argument to plot_scatterplot.
  • Added option to parallelize plot object construction.
  • Updated default config for create_report.

Bug Fixes

  • #74: Fixed a bug causing create_report failure due to zero complete rows.
  • #75: Fixed a bug in plot_str when plotting data.frame with more than 100 columns.
  • #82: Removed hard-coded scales from all plot functions.
  • Fixed a bug causing wrong column indices in split_columns.
  • Fixed a bug using standard deviation instead of variance in plot_prcomp.

DataExplorer 0.6.1

Enhancements

  • Updated vignette for better clarity.
  • #71: Added better error handler for plot_prcomp.

Bug Fixes

  • #69: Fixed bug causing create_report failure (specifically from plot_prcomp) when y is specified.
  • Added more unit tests for create_report and plot_prcomp.

DataExplorer 0.6.0

New Features

  • #15: Added plot_prcomp to visualize principal component analysis.
  • #54: Extracted dummify from plot_correlation as a new function.
  • #59: Added introduce for basic metadata.

Enhancements

  • #41: create_report can now be customized.
  • #53: Added page number for plots that span multiple pages.
  • #56: Added support for theme and customization for individual components.
  • #62: plot_bar now supports optional measures (in addition to categorical frequency) using argument with.
  • #66: Feature engineering functions works on other classes in addition to just data.table.
  • plot_missing:
    • Percentage text labels from output plot now has 2 decimals to prevent small percentages from being truncated to 0%.
    • Added example to quickly drop columns with too many missing values.
  • Added .ignoreCat and .getAllMissing to helper.

Bug Fixes

  • #55: Fixed bugs and updated vignette with latest functions.
  • #57: Fixed plot_str bug for not supporting S4 objects.
  • #63: Fixed plot_histogram and plot_density not working with column names containing spaces.

DataExplorer 0.5.0

New Features

  • #48: Added plot_scatterplot to visualize relationship of one feature against all other.
  • #50: Added plot_boxplot to visualize continuous distributions broken down by another feature.

Enhancements

  • #44: Added option to exclude categories in group_category.
  • #45: Added title option for all plots.
  • #46: Added option to exclude columns in set_missing.
  • #49 [Breaking Change]: Switched package to tidyverse style. All old functions are in .Deprecated mode. List of name changes in alphabetical order:
    • BarDiscrete -> plot_bar
    • CollapseCategory -> group_category
    • CorrelationContinuous-> plot_correlation(..., type = "continuous")
    • CorrelationDiscrete-> plot_correlation(..., type = "discrete")
    • DensityContinuous -> plot_density
    • DropVar -> drop_columns
    • GenerateReport -> create_report
    • HistogramContinuous -> plot_histogram
    • PlotMissing -> plot_missing
    • PlotStr -> plot_str
    • SetNaTo -> set_missing
    • SplitColType -> split_columns
  • #52: Combined CorrelationContinuous and CorrelationDiscrete into one function, and added option to view correlation of all features at once.
  • Optimized layout for multiple plots.

Bug Fixes

  • #47: Fixed color scale for correlation heatmap.

DataExplorer 0.4.0

New Features

  • #33: Added PlotStr to visualize data structure.
  • #40: Added network graph to GenerateReport.

Bug Fixes

  • #32: Fixed pandoc requirement error in unit test on cran.
  • #34: Fixed error message when quiet is not supplied. In addition, report directory are printed through message() instead of cat().
  • #35: Fixed rprojroot not found error.

Enhancements

  • #12: Added vignette: dataexplorer-intro.
  • #36: Fixed warnings from data.table in DropVar.
  • #37: Changed all cat() to message().
  • #38: Added option to order bars in BarDiscrete.
  • #39: Extended SetNaTo to discrete features.
  • Added more examples to README.md.

DataExplorer 0.3.0

New Features

  • #25: Added SetNaTo to quickly reset missing numerical values.
  • #29: Added DropVar to quickly drop variables by either name or column position.

Bug Fixes

  • #24: CorrelationDiscrete now displays all factor levels instead of full rank matrix from model.matrix.

Enhancements

  • #11: Functions with return values will now match the input class and set it back.
  • #22: Added documentation for num_all_missing in SplitColType.
  • #23: Added additional measures (in addition to frequency) to CollapseCategory.
  • #26: Removed density estimation section from report template.
  • #31: Added flexibility to name the new category in CollapseCategory.

Other notes

  • #30: In CollapseCategory, update = TRUE will only work with input data as data.table. However, it is still possible to view the frequency distribution with any input data class, as long as update = FALSE.

DataExplorer 0.2.6

Bug Fixes

  • #20: Fixed permission denied bug due to intermediates_dir argument in knitr::render.

Enhancements

  • #16: Improved handling of missing values.

DataExplorer 0.2.5

Bug Fixes

  • #18: GenerateReport now handles data without discrete or continuous features.

Enhancements

  • #14: Updated rmarkdown template for GenerateReport.
  • #1: Features with all NA values will be ignored in BarDiscrete.

DataExplorer 0.2.4

Bug Fixes

  • Fixed a major bug in GenerateReport function due to package renaming.

Enhancements

  • GenerateReport will now print the directory of the report to console.

DataExplorer 0.2.3

New Features

  • Added function CollapseCategory to collapse sparse categories for discrete features.
  • Added correlation heatmap for both continuous and discrete features.
  • Added density plot for continuous features.

Bug Fixes

  • Fixed a bug in BarDiscrete and CorrelationDiscrete for not plotting non-factor class.
  • Minor changes for CRAN re-submission.

Enhancements

  • Changed grid layout for BarDiscrete and HistogramContinuous.
  • Features with all missing values will be ignored.
  • Switched position between continuous and discrete features in report template.
  • Renamed package name to DataExplorer.
  • Added NEWS.md.
  • Removed BoxplotContinuous.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("DataExplorer")

0.8.0 by Boxuan Cui, 9 months ago


http://boxuancui.github.io/DataExplorer/


Report a bug at https://github.com/boxuancui/DataExplorer/issues


Browse source code at https://github.com/cran/DataExplorer


Authors: Boxuan Cui [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports data.table, reshape2, scales, ggplot2, gridExtra, rmarkdown, networkD3, stats, utils, tools, parallel

Suggests testthat, covr, knitr, jsonlite, nycflights13

System requirements: pandoc (>= 1.12.3) - http://pandoc.org


See at CRAN