Data Exploration with Information Theory (Weight-of-Evidence and Information Value)

Performs exploratory data analysis and variable screening for binary classification models using weight-of-evidence (WOE) and information value (IV). In order to make the package as efficient as possible, aggregations are done in data.table and creation of WOE vectors can be distributed across multiple cores. The package also supports exploration for uplift models (NWOE and NIV).


Binary classification models are perhaps the most common use-case in predictive analytics. The reason is that many key client actions across a wide range of industries are binary in nature, such as defaulting on a loan, clicking on an ad, or terminating a subscription.

Prior to building a binary classification model, a common step is to perform variable screening and exploratory data analysis. This is the step where we get to know the data and weed out variables that are either ill-conditioned or simply contain no information that will help us predict the action of interest. Note that the purpose of this step should not to be confused with that of multiple-variable selection techniques, such as stepwise regression, where the variables that go into the final model are selected. Rather, this is a precursory step designed to ensure that the approaches deployed during the final modeling phases are set up for success.

The weight of evidence (WOE) and information value (IV) provide a great framework for for exploratory analysis and variable screening for binary classifiers. WOE and IV have been used extensively in the credit risk world for several decades, and the underlying theory dates back to the 1950s. However, it is still not widely used in other industries.

WOE and IV analysis enable one to:

  • Consider each variable’s independent contribution to the outcome.
  • Detect linear and non-linear relationships.
  • Rank variables in terms of "univariate" predictive strength.
  • Visualize the correlations between the predictive variables and the binary outcome.
  • Seamlessly compare the strength of continuous and categorical variables without creating dummy variables.
  • Seamlessly handle missing values without imputation.
  • Assess the predictive power of missing values.

About the Information Package

The Information package is designed to perform WOE and IV analysis for binary classification models as well as uplift models. To maximize performance, aggregations are done in data.table, and creation of WOE vectors can be distributed across multiple cores.

Extensions to Exploratory Analysis for Uplift Models

Consider a direct marketing program where a test group received an offer of some sort, and the control group did not receive anything. The test and control groups are based on a random split. The lift of the campaign is defined as the difference in success rates between the test and control groups. In other words, the program can only be deemed successful if the offer outperforms the "do nothing" (a.k.a baseline) scenario.

The purpose of uplift models is to estimate the difference between the test and control groups, and then using the resulting model to target persuadables – i.e., potential or existing clients that are on the fence and need some sort of offer or contract to sign up or buy a product. Thus, when preparing to build an uplift model, we cannot only focus on the log odds of (Y=1) (where (Y) is some binary outcome), we need to analyze the log odds ratio of (Y=1) for the test group versus the control group. This can be handled by the net weight of evidence (NWOE) and the net information value (NIV).

Simple Example

library(Information)
# Set ncore=2 since CRAN does now allow more than 2 for examples
# For real applications, leave ncore is NULL to get the default which is: number of cores - 1
data(train, package="Information")
train <- subset(train, TREATMENT==1)
IV <- create_infotables(data=train, y="PURCHASE", ncore=2)
 
# Show the first records of the IV summary table
print(head(IV$Summary), row.names=FALSE)
 
# Show the WOE table for the variable called N_OPEN_REV_ACTS
print(IV$Tables$N_OPEN_REV_ACTS, row.names=FALSE)

How to Install

You can install:

  • The latest development version from github with
devtools::install_github("klarsen1/Information", "klarsen1")
  • The latest released version from CRAN with
install.packages("Information")

News

Package: Information

Version: 0.0.9

Information version 0.0.1.9000

Submit to CRAN.

Information version 0.0.2

Shortened the title.

Information version 0.0.3

Expanded documentation of the difference between creating individual plots versus grid plots.

Information version 0.0.4

  • Changed the vignette index title (the previous version's title just said "Vignette Title").

  • Added more examples to demonstrate the use of multiple cores.

  • Used the dontrun markdown keyword instead of commenting out code that should not be tested.

  • Exclude character variables with only one unique value from WOE/NWOE calculations.

  • Check if the treatment parameter is binary.

Information version 0.0.5

  • Fixed a bug that occurs when binary variables have NAs

Information version 0.0.6

  • Automatically remove "class==Date" variables from table generation
  • Accepting input data of format tbl and tbl_df
  • More precise namespace imports to avoid clashes with the upcoming release of ggplot2 2.0.0
  • Check if the treatment parameter is binary.

Information version 0.0.7

  • Fixed issues with vignettes

Information version 0.0.8

  • Fixed a bug in the penalty calculation that occurs for certain cases.

Information version 0.0.9

  • Removed the match-key from the WOE and NWOE tables when the parallel option is used.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("Information")

0.0.9 by Larsen Kim, 10 months ago


Browse source code at https://github.com/cran/Information


Authors: Larsen Kim [aut, cre]


Documentation:   PDF Manual  


GPL (>= 3) license


Imports data.table, ggplot2, grid, plyr, utils, iterators, doParallel, parallel, foreach

Suggests knitr, reshape2, ClustOfVar, rmarkdown


See at CRAN