Data exploration process for data analysis and model building, so that users could focus on understanding data and extracting insights. The package automatically scans through each variable and does data profiling. Typical graphical techniques will be performed for both discrete and continuous features.
Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis. Through this phase, analysts/modelers will have a first look of the data, and thus generate relevant hypothesis and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.
The package can be installed directly from CRAN.
However, the latest stable version (if any) could be found on GitHub, and installed using
if (!require(remotes)) install.packages("remotes") remotes::install_github("boxuancui/DataExplorer")
If you would like to install the latest development version, you may install the dev branch.
if (!require(remotes)) install.packages("remotes") remotes::install_github("boxuancui/DataExplorer", ref = "develop")
The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.
To get a report for the airquality dataset:
To get a report for the diamonds dataset from
library(DataExplorer) library(ggplot2) create_report(diamonds)
You may also run all the plotting functions individually for your analysis, e.g.,
library(DataExplorer) library(ggplot2) ## View distribution of all discrete variables plot_bar(diamonds) ## View distribution of cut only plot_bar(diamonds$cut) ## View correlation of all discrete varaibles plot_correlation(diamonds, type = "discrete") ## View distribution of all continuous variables plot_histogram(diamonds) ## View distribution of carat only plot_histogram(diamonds$carat) ## View correlation of all continuous varaibles plot_correlation(diamonds, type = "continuous") ## View overall correlation heatmap plot_correlation(diamonds) ## View distribution of missing values for airquality data missing_data <- plot_missing(airquality) # missing data profile will be returned missing_data
To visualize distributions based on another variable, you may do the following:
library(DataExplorer) ## View iris continuous distribution based on each Species plot_boxplot(iris, "Species") ## View iris continuous distribution based on different buckets of Sepal.Length plot_boxplot(iris, "Sepal.Length") ## Scatterplot Ozone against all other airquality features # Set some features to factor for (i in c("Month", "Day")) airquality[[i]] <- as.factor(airquality[[i]]) # Plot scatterplot # Note: discrete and continuous charts are plotted on separate pages! plot_scatterplot(airquality, "Ozone")
Sometimes, discrete variables are messy, e.g., too many imbalanced categories, extremely skewed categorical distribution. You may use
group_category function to help you group the long tails.
library(DataExplorer) library(ggplot2) data(diamonds) ## View original distribution of variable clarity diamonds <- data.table(diamonds) table(diamonds$clarity) ## Trial and error without updating: Group bottom 20% clarity based on frequency group_category(diamonds, "clarity", 0.2) ## Group bottom 30% clarity and update original dataset group_category(diamonds, "clarity", 0.3, update = TRUE) ## View distribution after updating table(diamonds$clarity) ## Group bottom 20% cut using value of carat table(diamonds$cut) group_category(diamonds, "cut", 0.2, measure = "carat", update = TRUE) table(diamonds$cut)
Note: this function works with data.table objects only. If you are working with
data.frame, please add
data.table class to your object and then remove it later. See example below.
library(DataExplorer) ## Set data.frame object to data.table USArrests <- data.table(USArrests) ## Collapse bottom 10% UrbanPop based on frequency group_category(USArrests, "UrbanPop", 0.1, update = TRUE) ## Set object back to data.frame class(USArrests) <- "data.frame"
plot_str: Plot data structure in network graph.
drop_columns: Quickly drop variables with either column index or column names. (data.table only)
set_missing: Quickly set all missing observations to a value. (data.table only)
split_columns: Split data into two objects: discrete and continous.
plot_scatterplotto visualize relationship of one feature against all other.
plot_boxplotto visualize continuous distributions broken down by another feature.
.Deprecatedmode. List of name changes in alphabetical order:
plot_correlation(..., type = "continuous")
plot_correlation(..., type = "discrete")
CorrelationDiscreteinto one function, and added option to view correlation of all features at once.
quietis not supplied. In addition, report directory are printed through
SetNaToto discrete features.
SetNaToto quickly reset missing numerical values.
DropVarto quickly drop variables by either name or column position.
CorrelationDiscretenow displays all factor levels instead of contrasts from
update = TRUEwill only work with input data as
data.table. However, it is still possible to view the frequency distribution with any input data class, as long as
update = FALSE.
GenerateReportnow handles data without discrete or continuous features.
NAvalues will be ignored in
GenerateReportfunction due to package renaming.
GenerateReportwill now print the directory of the report to console.
CollapseCategoryto collapse sparse categories for discrete features.
CorrelationDiscretefor not plotting non-factor class.