Simple Tools for Examining and Cleaning Dirty Data

The main janitor functions can: perfectly format data.frame column names; provide quick counts of variable combinations (i.e., frequency tables and crosstabs); and isolate duplicate records. Other janitor functions nicely format the tabulation results. These tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel. This package follows the principles of the "tidyverse" and works well with the pipe function %>%. janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.


-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" - The New York Times, 2014

janitor


Travis-CI Build Status Coverage Status CRAN_Status_Badge !Monthly Downloads !Downloads

janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.

The main janitor functions:

  • perfectly format data.frame column names;
  • provide quick one- and two-variable tabulations (i.e., frequency tables and crosstabs); and
  • isolate partially-duplicate records.

Other janitor functions nicely format the results of these tabulations. Together, these tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.

janitor is a #tidyverse-oriented package. Specifically, it plays nicely with the %>% pipe and is optimized for cleaning data brought in with the readr and readxl packages.

You can install:

  • the latest released version from CRAN with

    install.packages("janitor")
  • the latest development version from GitHub with

    if (packageVersion("devtools") < 1.6) {
      install.packages("devtools")
    }
    devtools::install_github("sfirke/janitor")

Using janitor

Below are quick examples of how janitor tools are commonly used. A full description of each function can be found in janitor's catalog of functions.

Cleaning dirty data

Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx: All kinds of dirty.

Dirtiness includes:

  • Dreadful column names
  • Rows and columns containing Excel formatting but no data
  • Dates stored as numbers
  • Values spread inconsistently over the "Certification" columns

Here's that data after being read in to R:

library(pacman) # for loading packages
p_load(readxl, janitor, dplyr)
 
roster_raw <- read_excel("dirty_data.xlsx") # available at http://github.com/sfirke/janitor
glimpse(roster_raw)
#> Observations: 17
#> Variables: 12
#> $ First Name        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA,...
#> $ Last Name         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lamar...
#> $ Employee Status   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Teach...
#> $ Subject           <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English", "...
#> $ Hire Date         <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 34700...
#> $ % Allocated       <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80, NA, NA,...
#> $ Full time?        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", "N...
#> $ do not edit! ---> <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ Certification     <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science 6-...
#> $ Certification     <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", NA,...
#> $ Certification     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $                   <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

Excel formatting led to an untitled empty column and 5 empty rows at the bottom of the table (only 12 records have any actual data). Bad column names are preserved.

Clean it with janitor functions:

roster <- roster_raw %>%
  clean_names() %>%
  remove_empty_rows() %>%
  remove_empty_cols() %>%
  mutate(hire_date = excel_numeric_to_date(hire_date)) %>%
  select(-certification, -certification_2) # drop unwanted columns
 
roster
#> # A tibble: 12 × 7
#>      first_name last_name employee_status    subject  hire_date percent_allocated full_time
#>           <chr>     <chr>           <chr>      <chr>     <date>             <dbl>     <chr>
#>  1        Jason    Bourne         Teacher         PE 2008-08-30              0.75       Yes
#>  2        Jason    Bourne         Teacher   Drafting 2008-08-30              0.25       Yes
#>  3       Alicia      Keys         Teacher      Music 2001-08-15              1.00       Yes
#>  4          Ada  Lovelace         Teacher       <NA> 1975-05-01              1.00       Yes
#>  5        Desus      Nice  Administration       Dean 2013-06-06              1.00       Yes
#>  6 Chien-Shiung        Wu         Teacher    Physics 1930-03-20              0.50       Yes
#>  7 Chien-Shiung        Wu         Teacher  Chemistry 1930-03-20              0.50       Yes
#>  8        James     Joyce         Teacher    English 1990-05-01              0.50        No
#>  9         Hedy    Lamarr         Teacher    Science 1976-06-08              0.50        No
#> 10       Carlos    Boozer           Coach Basketball 2015-08-05                NA        No
#> 11        Young    Boozer           Coach       <NA> 1995-01-01                NA        No
#> 12      Micheal    Larsen         Teacher    English 2009-09-15              0.80        No

The core janitor cleaning function is clean_names() - call it whenever you load data into R.

Examining dirty data

Finding duplicates

Use get_dupes() to identify and examine duplicate records during data cleaning. Let's see if any teachers are listed more than once:

roster %>% get_dupes(first_name, last_name)
#> # A tibble: 4 × 8
#>     first_name last_name dupe_count employee_status   subject  hire_date percent_allocated full_time
#>          <chr>     <chr>      <int>           <chr>     <chr>     <date>             <dbl>     <chr>
#> 1 Chien-Shiung        Wu          2         Teacher   Physics 1930-03-20              0.50       Yes
#> 2 Chien-Shiung        Wu          2         Teacher Chemistry 1930-03-20              0.50       Yes
#> 3        Jason    Bourne          2         Teacher        PE 2008-08-30              0.75       Yes
#> 4        Jason    Bourne          2         Teacher  Drafting 2008-08-30              0.25       Yes

Yes, some teachers appear twice. We ought to address this before counting employees.

Tabulating tools

janitor has several functions for quick counts. The big ones are tabyl() and crosstab().

Notably, they can be called two ways:

  • On vectors - e.g., tabyl(roster$subject)
  • On a piped-in data.frame: roster %>% tabyl(subject).
    • This allows for dplyr commands earlier in the pipeline
tabyl()

Like table(), but pipe-able and fully featured.

roster %>%
  tabyl(subject)
#>       subject n    percent valid_percent
#> 1  Basketball 1 0.08333333           0.1
#> 2   Chemistry 1 0.08333333           0.1
#> 3        Dean 1 0.08333333           0.1
#> 4    Drafting 1 0.08333333           0.1
#> 5     English 2 0.16666667           0.2
#> 6       Music 1 0.08333333           0.1
#> 7          PE 1 0.08333333           0.1
#> 8     Physics 1 0.08333333           0.1
#> 9     Science 1 0.08333333           0.1
#> 10       <NA> 2 0.16666667            NA
crosstab()
roster %>%
  filter(hire_date > as.Date("1950-01-01")) %>%
  crosstab(employee_status, full_time)
#>   employee_status No Yes
#> 1  Administration  0   1
#> 2           Coach  2   0
#> 3         Teacher  3   4
Prettifying

Other janitor functions dress up the results of these tabulation calls for fast, basic reporting. Here are some of the functions that augment a summary table for reporting:

roster %>%
  tabyl(employee_status, sort = TRUE) %>%
  adorn_totals("row")
#>   employee_status  n    percent
#> 1         Teacher  9 0.75000000
#> 2           Coach  2 0.16666667
#> 3  Administration  1 0.08333333
#> 4           Total 12 1.00000000
 
roster %>%
  crosstab(full_time, employee_status) %>%
  adorn_crosstab(denom = "col", show_totals = TRUE)
#>   full_time Administration      Coach   Teacher     Total
#> 1        No       0.0% (0) 100.0% (2) 33.3% (3) 41.7% (5)
#> 2       Yes     100.0% (1)   0.0% (0) 66.7% (6) 58.3% (7)

Together, these tabulation functions reduce R's deficit against Excel and SPSS when it comes to quick, informative counts.

Contact me

You are welcome to:

News

NEWS

janitor 0.3.1 (Release date: 2018-01-04)

Release summary

This is a bug-fix release with no new functionality or changes. It fixes a bug where adorn_crosstab() failed if the tibble package was version > 1.4.

Major changes to janitor are currently in development on GitHub and will be released soon. This is not that next big release.

janitor 0.3.0 (Release date: 2017-05-06)

Release summary

The primary purpose of this release is to maintain accuracy given the changes to the dplyr package, upon which janitor is built, in dplyr version 0.6.0. This update also contains a number of minor improvements.

Critical: if you update the package dplyr to version 0.6.0, you must update janitor to version 0.3.0 to ensure accurate results from janitor's tabyl() function. This is due to a change in the behavior of dplyr's _join functions (discussed in #111).

janitor 0.3.0 is compatible with this new version of dplyr as well as old versions of dplyr back to 0.5.0. That is, updating janitor to 0.3.0 does not necessitate an update to dplyr 0.6.0.

Breaking changes

  • The functions add_totals_row and add_totals_col were combined into a single function, adorn_totals(). (#57). The add_totals_ functions are now deprecated and should not be used.
  • The first argument of adorn_crosstab() is now "dat" instead of "crosstab" (indicating that the function can be called on any data.frame, not just a result of crosstab())

Features

Major

  • Exported the %>% pipe from magrittr (#107).

Deprecated the following functions:

  • use_first_valid_of() - use dplyr::coalesce() instead
  • convert_to_NA() - use dplyr::na_if() instead
  • add_totals_row() and add_totals_col() - replaced by the single function adorn_totals()

Minor

  • adorn_totals() and ns_to_percents() can now be called on data.frames that have non-numeric columns beyond the first one (those columns will be ignored) (#57)
  • adorn_totals("col") retains factor class in 1st column if 1st column in the input data.frame was a factor

Bug fixes

  • clean_names() now handles leading spaces (#85)
  • adorn_crosstab() and ns_to_percents() work on a 2-column data.frame (#89)
  • adorn_totals() now works on a grouped tibble (#97)
  • Long variable names with spaces no longer break tabyl() and crosstab() (#87)
  • An NA_ column in the result of a crosstab() will appear at the last column position (#109)

janitor 0.2.1 (Release date: 2016-10-30)

Bug fixes

  • tabyl() and crosstab() now appear in the package manual (#65)
  • Fixed minor bug per CRAN request - tabyl() and crosstab() failed to retain ill-formatted variable names only when using R 3.2.5 for Windows (#76)
  • add_totals_row() works on two-column data.frame (#69)
  • use_first_valid_of() returns POSIXct-class result when given POSIXct inputs

janitor 0.2.0 (Release date: 2016-10-03)

Features

Major

Submitted to CRAN!

Minor

  • The count in tabyl() for factor levels that aren't present is now 0 instead of NA (#48)

Bug fixes

  • Can call tabyl() on the result of a tabyl(), e.g., mtcars %>% tabyl(mpg) %>% tabyl(n) (#54)
  • get_dupes() now works on variables with spaces in column names (#62)

Package management

  • Reached 100% unit test code coverage

janitor 0.1.2

Features

Major

  • Added a function adorn_crosstab() that formats the results of a crosstab() for pretty printing. Shows % and N in the same cell, with the % symbol, user-specified rounding (method and number of digits), and the option to include a totals row and/or column. E.g., mtcars %>% crosstab(cyl, gear) %>% adorn_crosstab().
  • crosstab() can be called in a %>% pipeline, e.g., mtcars %>% crosstab(cyl, gear). Thanks to @chrishaid (#34)
  • tabyl() can also be called in a %>% pipeline, e.g., mtcars %>% tabyl(cyl) (#35)
  • Added use_first_valid_of() function (#32)
  • Added minor functions for manipulating numeric data.frames for presentation: ns_to_percents(), add_totals_row(), add_totals_col(),

Minor

  • crosstab() returns 0 instead of NA when there are no instances of a variable combination.
  • A call like tabyl(df$vecname) retains the more-descriptive $ symbol in the column name of the result - if you want a legal R name in the result, call it as df %>% tabyl(vecname)
  • Single and double quotation marks are handled by clean_names()

Package management

  • Added codecov to measure test coverage
  • Added unit test coverage
  • Added Travis-CI for continuous integration

janitor 0.1 (Release date: 2016-04-17)

  • Initial draft of skeleton package on GitHub

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("janitor")

1.0.0 by Sam Firke, 2 months ago


https://github.com/sfirke/janitor


Report a bug at https://github.com/sfirke/janitor/issues


Browse source code at https://github.com/cran/janitor


Authors: Sam Firke [aut, cre], Chris Haid [ctb], Ryan Knight [ctb]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports dplyr, tidyr, snakecase, magrittr, purrr, rlang

Suggests testthat, knitr, rmarkdown


Imported by ballr, bomrang, moderndive.

Suggested by fivethirtyeight.


See at CRAN