Data Explorer

Data exploration process for data analysis and model building, so that users could focus on understanding data and extracting insights. The package automatically scans through each variable and does data profiling. Typical graphical techniques will be performed for both discrete and continuous features.

Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis. Through this phase, analysts/modelers will have a first look of the data, and thus generate relevant hypothesis and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

The package can be installed from github using devtools package.

if (!require(devtools)) install.packages("devtools")

If you would like to get the latest development version, you may run the following code in R.

if (!require(devtools)) install.packages("devtools")
install_github("boxuancui/DataExplorer", ref="develop")

The package is extremely easy to use. Almost everything could be done in one line of R code. Please refer to the package manuals for more information.

To get a report for the iris dataset:


To get a report for the diamonds dataset in ggplot2 package:




  • #25: Added SetNaTo to quickly reset missing numerical values.
  • #29: Added DropVar to quickly drop variables by either name or column position.
  • #24: CorrelationDiscrete now displays all factor levels instead of contrasts from model.matrix.
  • #11: Functions with return values will now match the input class and set it back.
  • #22: Added documentation for num_all_missing in SplitColType.
  • #23: Added additional measures (in addition to frequency) to CollapseCategory.
  • #26: Removed density estimation section from report template.
  • #31: Added flexibility to name the new category in CollapseCategory.
  • #30: In CollapseCategory, update = TRUE will only work with input data as data.table. However, it is still possible to view the frequency distribution with any input data class, as long as update = FALSE.

  • #20: Fixed permission denied bug due to intermediates_dir argument in knitr::render.
  • #16: Improved handling of missing values.

  • #18: GenerateReport now handles data without discrete or continuous features.
  • #14: Updated rmarkdown template for GenerateReport.
  • #1: Features with all NA values will be ignored in BarDiscrete.

  • Fixed a major bug in GenerateReport function due to package renaming.
  • GenerateReport will now print the directory of the report to console.

  • Added function CollapseCategory to collapse sparse categories for discrete features.
  • Added correlation heatmap for both continuous and discrete features.
  • Added density plot for continuous features.
  • Fixed a bug in BarDiscrete and CorrelationDiscrete for not plotting non-factor class.
  • Minor changes for CRAN re-submission.
  • Changed grid layout for BarDiscrete and HistogramContinuous.
  • Features with all missing values will be ignored.
  • Switched position between continuous and discrete features in report template.
  • Renamed package name to DataExplorer.
  • Added
  • Removed BoxplotContinuous.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.4.0 by Boxuan Cui, 10 months ago

Report a bug at

Browse source code at

Authors: Boxuan Cui [aut, cre]

Documentation:   PDF Manual  

GPL-2 license

Imports data.table, reshape2, ggplot2, scales, gridExtra, rmarkdown, networkD3, stats, utils

Suggests testthat, covr, knitr, jsonlite, nycflights13

System requirements: pandoc (>= 1.12.3) -

See at CRAN