Data Explorer

Data exploration process for data analysis and model building, so that users could focus on understanding data and extracting insights. The package automatically scans through each variable and does data profiling. Typical graphical techniques will be performed for both discrete and continuous features.


master v0.5.0

Travis Build Status AppVeyor Build Status codecov

develop v0.5.0

Travis Build Status AppVeyor Build Status codecov


Background

Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis. Through this phase, analysts/modelers will have a first look of the data, and thus generate relevant hypothesis and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

However, the latest stable version (if any) could be found on GitHub, and installed using remotes package.

if (!require(remotes)) install.packages("remotes")
remotes::install_github("boxuancui/DataExplorer")

If you would like to install the latest development version, you may install the dev branch.

if (!require(remotes)) install.packages("remotes")
remotes::install_github("boxuancui/DataExplorer", ref = "develop")

Examples

The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.

Create data profiling report

To get a report for the airquality dataset:

library(DataExplorer)
create_report(airquality)

To get a report for the diamonds dataset from ggplot2 package:

library(DataExplorer)
library(ggplot2)
create_report(diamonds)

Visualize various distribution

You may also run all the plotting functions individually for your analysis, e.g.,

library(DataExplorer)
library(ggplot2)

## View distribution of all discrete variables
plot_bar(diamonds)
## View distribution of cut only
plot_bar(diamonds$cut)
## View correlation of all discrete varaibles
plot_correlation(diamonds, type = "discrete")

## View distribution of all continuous variables
plot_histogram(diamonds)
## View distribution of carat only
plot_histogram(diamonds$carat)
## View correlation of all continuous varaibles
plot_correlation(diamonds, type = "continuous")

## View overall correlation heatmap
plot_correlation(diamonds)

## View distribution of missing values for airquality data
missing_data <- plot_missing(airquality) # missing data profile will be returned
missing_data

Slice and dice your data

To visualize distributions based on another variable, you may do the following:

library(DataExplorer)

## View iris continuous distribution based on each Species
plot_boxplot(iris, "Species")

## View iris continuous distribution based on different buckets of Sepal.Length
plot_boxplot(iris, "Sepal.Length")

## Scatterplot Ozone against all other airquality features
# Set some features to factor
for (i in c("Month", "Day")) airquality[[i]] <- as.factor(airquality[[i]])
# Plot scatterplot
# Note: discrete and continuous charts are plotted on separate pages!
plot_scatterplot(airquality, "Ozone")

Group categories for discrete features

Sometimes, discrete variables are messy, e.g., too many imbalanced categories, extremely skewed categorical distribution. You may use group_category function to help you group the long tails.

library(DataExplorer)
library(ggplot2)
data(diamonds)

## View original distribution of variable clarity
diamonds <- data.table(diamonds)
table(diamonds$clarity)

## Trial and error without updating: Group bottom 20% clarity based on frequency
group_category(diamonds, "clarity", 0.2)
## Group bottom 30% clarity and update original dataset
group_category(diamonds, "clarity", 0.3, update = TRUE)

## View distribution after updating
table(diamonds$clarity)

## Group bottom 20% cut using value of carat
table(diamonds$cut)
group_category(diamonds, "cut", 0.2, measure = "carat", update = TRUE)
table(diamonds$cut)

Note: this function works with data.table objects only. If you are working with data.frame, please add data.table class to your object and then remove it later. See example below.

library(DataExplorer)

## Set data.frame object to data.table
USArrests <- data.table(USArrests)
## Collapse bottom 10% UrbanPop based on frequency
group_category(USArrests, "UrbanPop", 0.1, update = TRUE)
## Set object back to data.frame
class(USArrests) <- "data.frame"

Other miscellaneous functions

  • plot_str: Plot data structure in network graph.
  • drop_columns: Quickly drop variables with either column index or column names. (data.table only)
  • set_missing: Quickly set all missing observations to a value. (data.table only)
  • split_columns: Split data into two objects: discrete and continous.

News

Changelog

DataExplorer 0.5.0

New Features

  • #48: Added plot_scatterplot to visualize relationship of one feature against all other.
  • #50: Added plot_boxplot to visualize continuous distributions broken down by another feature.

Enhancements

  • #44: Added option to exclude categories in group_category.
  • #45: Added title option for all plots.
  • #46: Added option to exclude columns in set_missing.
  • #49 [Breaking Change]: Switched package to tidyverse style. All old functions are in .Deprecated mode. List of name changes in alphabetical order:
    • BarDiscrete -> plot_bar
    • CollapseCategory -> group_category
    • CorrelationContinuous-> plot_correlation(..., type = "continuous")
    • CorrelationDiscrete-> plot_correlation(..., type = "discrete")
    • DensityContinuous -> plot_density
    • DropVar -> drop_columns
    • GenerateReport -> create_report
    • HistogramContinuous -> plot_histogram
    • PlotMissing -> plot_missing
    • PlotStr -> plot_str
    • SetNaTo -> set_missing
    • SplitColType -> split_columns
  • #52: Combined CorrelationContinuous and CorrelationDiscrete into one function, and added option to view correlation of all features at once.
  • Optimized layout for multiple plots.

Bug Fixes

  • #47: Fixed color scale for correlation heatmap.

DataExplorer 0.4.0

New Features

  • #33: Added PlotStr to visualize data structure.
  • #40: Added network graph to GenerateReport.

Bug Fixes

  • #32: Fixed pandoc requirement error in unit test on cran.
  • #34: Fixed error message when quiet is not supplied. In addition, report directory are printed through message() instead of cat().
  • #35: Fixed rprojroot not found error.

Enhancements

  • #12: Added vignette: dataexplorer-intro.
  • #36: Fixed warnings from data.table in DropVar.
  • #37: Changed all cat() to message().
  • #38: Added option to order bars in BarDiscrete.
  • #39: Extended SetNaTo to discrete features.
  • Added more examples in README file.

DataExplorer 0.3.0

New Features

  • #25: Added SetNaTo to quickly reset missing numerical values.
  • #29: Added DropVar to quickly drop variables by either name or column position.

Bug Fixes

  • #24: CorrelationDiscrete now displays all factor levels instead of contrasts from model.matrix.

Enhancements

  • #11: Functions with return values will now match the input class and set it back.
  • #22: Added documentation for num_all_missing in SplitColType.
  • #23: Added additional measures (in addition to frequency) to CollapseCategory.
  • #26: Removed density estimation section from report template.
  • #31: Added flexibility to name the new category in CollapseCategory.

Other notes

  • #30: In CollapseCategory, update = TRUE will only work with input data as data.table. However, it is still possible to view the frequency distribution with any input data class, as long as update = FALSE.

DataExplorer 0.2.6

Bug Fixes

  • #20: Fixed permission denied bug due to intermediates_dir argument in knitr::render.

Enhancements

  • #16: Improved handling of missing values.

DataExplorer 0.2.5

Bug Fixes

  • #18: GenerateReport now handles data without discrete or continuous features.

Enhancements

  • #14: Updated rmarkdown template for GenerateReport.
  • #1: Features with all NA values will be ignored in BarDiscrete.

DataExplorer 0.2.4

Bug Fixes

  • Fixed a major bug in GenerateReport function due to package renaming.

Enhancements

  • GenerateReport will now print the directory of the report to console.

DataExplorer 0.2.3

New Features

  • Added function CollapseCategory to collapse sparse categories for discrete features.
  • Added correlation heatmap for both continuous and discrete features.
  • Added density plot for continuous features.

Bug Fixes

  • Fixed a bug in BarDiscrete and CorrelationDiscrete for not plotting non-factor class.
  • Minor changes for CRAN re-submission.

Enhancements

  • Changed grid layout for BarDiscrete and HistogramContinuous.
  • Features with all missing values will be ignored.
  • Switched position between continuous and discrete features in report template.
  • Renamed package name to DataExplorer.
  • Added NEWS.md.
  • Removed BoxplotContinuous.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("DataExplorer")

0.5.0 by Boxuan Cui, 4 months ago


https://github.com/boxuancui/DataExplorer


Report a bug at https://github.com/boxuancui/DataExplorer/issues


Browse source code at https://github.com/cran/DataExplorer


Authors: Boxuan Cui [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports data.table, reshape2, ggplot2, scales, gridExtra, rmarkdown, networkD3, stats, utils

Suggests testthat, covr, knitr, jsonlite, nycflights13

System requirements: pandoc (>= 1.12.3) - http://pandoc.org


See at CRAN