Preliminary Visualisation of Data

Create preliminary exploratory data visualisations of an entire dataset to identify problems or unexpected features using 'ggplot2'.


rOpenSciBadgeJOSSstatusDOI

Travis BuildStatusAppVeyorbuildstatusCoverageStatus

CRAN_Status_BadgeCRANLogs

Project Status: Active – The project has reached a stable, usablestate and is being activelydeveloped.

How to install

visdat is available on CRAN

 
install.packages("visdat")

If you would like to use the development version, install from github with:

 
# install.packages("devtools")
devtools::install_github("ropensci/visdat")

What does visdat do?

Initially inspired by csv-fingerprint, vis_dat helps you visualise a dataframe and “get a look at the data” by displaying the variable classes in a dataframe as a plot with vis_dat, and getting a brief look into missing data patterns using vis_miss.

visdat has 6 functions:

  • vis_dat() visualises a dataframe showing you what the classes of the columns are, and also displaying the missing data.

  • vis_miss() visualises just the missing data, and allows for missingness to be clustered and columns rearranged. vis_miss() is similar to missing.pattern.plot from the mi package. Unfortunately missing.pattern.plot is no longer in the mi package (as of 14/02/2016).

  • vis_compare() visualise differences between two dataframes of the same dimensions

  • vis_expect() visualise where certain conditions hold true in your data

  • vis_cor() visualise the correlation of variables in a nice heatmap

  • vis_guess() visualise the individual class of earch value in your data

You can read more about visdat in the vignette, “using visdat”.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Examples

Using vis_dat()

Let’s see what’s inside the airquality dataset from base R, which contains information about daily air quality measurements in New York from May to September 1973. More information about the dataset can be found with ?airquality.

 
library(visdat)
 
vis_dat(airquality)

The plot above tells us that R reads this dataset as having numeric and integer values, with some missing data in Ozone and Solar.R. The classes are represented on the legend, and missing data represented by grey. The column/variable names are listed on the x axis.

Using vis_miss()

We can explore the missing data further using vis_miss():

 
vis_miss(airquality)

Percentages of missing/complete in vis_miss are accurate to 1 decimal place.

You can cluster the missingness by setting cluster = TRUE:

 
vis_miss(airquality, 
         cluster = TRUE)

Columns can also be arranged by columns with most missingness, by setting sort_miss = TRUE:

 
vis_miss(airquality,
         sort_miss = TRUE)

vis_miss indicates when there is a very small amount of missing data at <0.1% missingness:

 
test_miss_df <- data.frame(x1 = 1:10000,
                           x2 = rep("A", 10000),
                           x3 = c(rep(1L, 9999), NA))
 
vis_miss(test_miss_df)

vis_miss will also indicate when there is no missing data at all:

 
vis_miss(mtcars)

To further explore the missingness structure in a dataset, I recommend the naniar package, which provides more general tools for graphical and numerical exploration of missing values.

Using vis_compare()

Sometimes you want to see what has changed in your data. vis_compare() displays the differences in two dataframes of the same size. Let’s look at an example.

Let’s make some changes to the chickwts, and compare this new dataset:

chickwts_diff <- chickwts
chickwts_diff[sample(1:nrow(chickwts), 30),sample(1:ncol(chickwts), 2)] <- NA
 
vis_compare(chickwts_diff, chickwts)

Here the differences are marked in blue.

If you try and compare differences when the dimensions are different, you get an ugly error:

 
chickwts_diff_2 <- chickwts
chickwts_diff_2$new_col <- chickwts_diff_2$weight*2
 
vis_compare(chickwts, chickwts_diff_2)
# Error in vis_compare(chickwts, chickwts_diff_2) : 
#   Dimensions of df1 and df2 are not the same. vis_compare requires dataframes of identical dimensions.

Using vis_expect()

vis_expect visualises certain conditions or values in your data. For example, If you are not sure whether to expect values greater than 25 in your data (airquality), you could write: `vis_expect(airquality, ~.x data are greater than or equal to 25:

 
vis_expect(airquality, ~.x >= 25)

This shows the proportion of times that there are values greater than 25, as well as the missings.

Using vis_cor()

To make it easy to plot correlations of your data, use vis_cor:

 
vis_cor(airquality)

Using vis_guess()

vis_guess() takes a guess at what each cell is. It’s best illustrated using some messy data, which we’ll make here:

 
messy_vector <- c(TRUE,
                  T,
                  "TRUE",
                  "T",
                  "01/01/01",
                  "01/01/2001",
                  NA,
                  NaN,
                  "NA",
                  "Na",
                  "na",
                  "10",
                  10,
                  "10.1",
                  10.1,
                  "abc",
                  "$%TG")
 
set.seed(1114)
messy_df <- data.frame(var1 = messy_vector,
                       var2 = sample(messy_vector),
                       var3 = sample(messy_vector))
 
vis_guess(messy_df)
vis_dat(messy_df)

So here we see that there are many different kinds of data in your dataframe. As an analyst this might be a depressing finding. We can see this comparison above.

Thank yous

Thank you to Ivan Hanigan who first commented this suggestion after I made a blog post about an initial prototype ggplot_missing, and Jenny Bryan, whose tweet got me thinking about vis_dat, and for her code contributions that removed a lot of errors.

Thank you to Hadley Wickham for suggesting the use of the internals of readr to make vis_guess work. Thank you to Miles McBain for his suggestions on how to improve vis_guess. This resulted in making it at least 2-3 times faster. Thanks to Carson Sievert for writing the code that combined plotly with visdat, and for Noam Ross for suggesting this in the first place. Thank you also to Earo Wang and Stuart Lee for their help in getting capturing expressions in vis_expect.

Finally thank you to rOpenSci and it’s amazing onboarding process, this process has made visdat a much better package, thanks to the editor Noam Ross (@noamross), and the reviewers Sean Hughes (@seaaan) and Mara Averick (@batpigandme).

ropensci_footer

News

visdat 0.5.1 (2018/07/02) "The Northern Lights Moonwalker"

New Feature

  • vis_compare() for comparing two dataframes of the same dimensions
  • vis_expect() for visualising where certain values of expectations occur in the data
    • Added NA colours to vis_expect
    • Added show_perc arg to vis_expect to show the percentage of expectations that are TRUE. #73
  • vis_cor to visualise correlations in a dataframe
  • vis_guess() for displaying the likely type for each cell in a dataframe
  • Added draft vis_expect to make it easy to look at certain appearances of numbers in your data.
  • visdat is now under the rOpenSci github repository

Minor Changes

  • added CITATION for visdat to cite the JOSS article
  • updated options for vis_cor to use argument na_action not use_op.
  • cleaned up the organisation of the files and internal functions
  • Added appropriate legend and x axis for vis_miss_ly - thanks to Stuart Lee
  • Updated the paper.md for JOSS
  • Updated some old links in doco
  • Added Sean Hughes and Mara Averick to the DESCRIPTION with ctb.
  • Minor changes to the paper for JOSS

Bug Fixes

  • Fix bug reported in #75 where vis_dat(diamonds) errored seq_len(nrow(x)) inside internal function vis_gather_, used to calculate the row numbers. Using mutate(rows = dplyr::row_number()) solved the issue.

  • Fix bug reported in #72 where vis_miss errored when one column was given to it. This was an issue with using limits inside scale_x_discrete - which is used to order the columns of the data. It is not necessary to order one column of data, so I created an if-else to avoid this step and return the plot early.

  • Fix visdat x axis alignment when show_perc_col = FALSE - #82

  • fix visdat x axis alignment - issue 57

  • fix bug where the column percentage missing would print to be NA when it was exactly equal to 0.1% missing. - issue 62

  • vis_cor didn't gather variables for plotting appropriately - now fixed

visdat 0.1.0 (2017/07/03) ("JOSS")

  • lightweight CRAN submission - will only contain functions vis_dat and vis_miss

visdat 0.0.7.9100 (2017/07/03)

New Features

  • add_vis_dat_pal() (internal) to add a palette for vis_dat and vis_guess
  • vis_guess now gets a palette argument like vis_dat
  • Added protoype/placeholder functions for plotly vis_*_ly interactive graphs:
    • vis_guess_ly()
    • vis_dat_ly()
    • vis_compare_ly() These simply wrap plotly::ggplotly(vis_*(data)). In the future they will be written in plotly so that they can be generated much faster

Minor improvements

  • corrected testing for vis_* family
  • added .svg graphics for correct vdiffr testing
  • improved hover print method for plotly.

visdat 0.0.6.9000 (2017/02/26)

New Features

  • axes in vis_ family are now flipped by default
  • vis_miss now shows the % missingness in a column, can be disabled by setting show_perc_col argument to FALSE
  • removed flip argument, as this should be the default

Minor Improvements

  • added internal functions to improve extensibility and debugging - vis_create_, vis_gather_ and vis_extract_value_.
  • suppress unneeded warnings arising from compiling factors

visdat 0.0.5.9000 (2017/01/09)

Minor Improvements

  • Added testing for visualisations with vdiffr. Code coverage is now at 99%
  • Fixed up suggestions from goodpractice::gp()
  • Submitted to rOpenSci onboarding
  • paper.md written and submitted to JOSS

visdat 0.0.4.9999 (2017/01/08)

New Feature

  • Added feature flip = TRUE, to vis_dat and vis_miss. This flips the x axis and the ordering of the rows. This more closely resembles a dataframe.
  • vis_miss_ly is a new function that uses plotly to plot missing data, like vis_miss, but interactive, without the need to call plotly::ggplotly on it. It's fast, but at the moment it needs a bit of love on the legend front to maintain the style and features (clustering, etc) of current vis_miss.
  • vis_miss now gains a show_perc argument, which displays the % of missing and complete data. This is switched on by default and addresses issue #19.

New Feature (under development)

  • vis_compare is a new function that allows you to compare two dataframes of the same dimension. It gives a fairly ugly warning if they are not of the same dimension.
  • vis_dat gains a "palette" argument in line with issue 26, drawn from http://colorbrewer2.org/, there are currently three arguments, "default", "qual", and "cb_safe". "default" provides the ggplot defaults, "qual" uses some colour blind unfriendly colours, and "cb_safe" provides some colours friendly for colour blindness.

Minor Improvements

  • All lines are < 80 characters long
  • removed all instances of 1:rnow(x) and replaced with seq_along(nrow(x)).
  • Updated documentation, improved legend and colours for vis_miss_ly.
  • removed export for vis_dat_ly, as it currently does not work.
  • Removed a lot of unnecessary @importFrom tags, included magrittr in this, and added magrittr to Imports
  • Changes ALL CAPS Headers in news to Title Case
  • Made it clear that vis_guess() and vis_compare are very beta
  • updated documentation in README and vis_dat(), vis_miss(), vis_compare(), and vis_guess()
  • updated pkgdown docs
  • updated DESCRIPTION URL and bug report
  • Changed the default colours of vis_compare to be different to the ggplot2 standards.
  • vis_miss legend labels are created using the internal function miss_guide_label. miss_guide_label will check if data is 100% missing or 100% present and display this in the figure. Additionally, if there is less than 0.1% missing data, "<0.1% missingness" will also be displayed. This sort of gets around issue #18 for the moment.
  • tests have been added for the miss_guide_label legend labels function.
  • Changed legend label for vis_miss, vis_dat, and vis_guess.
  • updated README
  • Added vignette folder (but not vignettes added yet)
  • Added appveyor-CI and travis-CI, addressing issues #22 and #23

Bug Fixes

  • Update vis_dat() to use purrr::dmap(fingerprint) instead of mutate_each_(). This solves issue #3 where vis_dat couldn't take variables with spaces in their name.

visdat 0.0.3.9000

=========================

New Features

  • Interactivity with plotly::ggplotly! Funcions vis_guess(), vis_dat(), and vis_miss were updated so that you can make them all interactive using the latest dev version of plotly from Carson Sievert.

visdat 0.0.2.9000

=========================

New Features

  • Introducing vis_guess(), a function that uses the unexported function collectorGuess from readr.

visdat 0.0.1.9000

=========================

New Features

  • vis_miss() and vis_dat actually run

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("visdat")

0.5.3 by Nicholas Tierney, 6 days ago


http://visdat.njtierney.com/, https://github.com/ropensci/visdat


Report a bug at https://github.com/ropensci/visdat/issues


Browse source code at https://github.com/cran/visdat


Authors: Nicholas Tierney [aut, cre] , Sean Hughes [rev] , Sean Hughes reviewed the package for rOpenSci , see https://github.com/ropensci/onboarding/issues/87) , Mara Averick [rev] (Mara Averick reviewed the package for rOpenSci , see https://github.com/ropensci/onboarding/issues/87) , Stuart Lee [ctb] , Earo Wang [ctb] , Nic Crane [ctb]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports ggplot2, tidyr, dplyr, purrr, readr, magrittr, stats, tibble, glue

Suggests testthat, plotly, knitr, rmarkdown, vdiffr, gdtools, spelling


Imported by PCRedux, naniar.


See at CRAN