Data Checking

Checks column names, column classes, values, keys, joins, vectors and scalars. If the user-defined conditions are met the function returns an invisible copy of the data frame, vector or scalar. Otherwise the function throws an informative error.


datacheckr is an R package to check data frame's rows, column names, column classes, values, unique keys and joins.

There are several existing R packages for checking data frames including assertr, assertive and datacheck. They are great for checking data in scripts but they have several limitations when embedded in functions in packages.

Consider the following code.

library(assertr)
assert(mtcars, within_bounds(0,1), mpg)
#> Error: 
#> Vector 'mpg' violates assertion 'within_bounds' 32 times (e.g. [21] at index 1)

The error message is not that helpful for a user who is not familiar with the internals of a function that has just thrown that error.

The same test using the datacheckr::check_data() function produces an error message which is more likely to allow the end user to diagnose the problem.

library(datacheckr)
check_data(mtcars, list(mpg = c(0,1)))
#> Warning: 'check_data' is deprecated.
#> Use 'check_data1' instead.
#> See help("Deprecated")
#> Error: the values in column mpg in mtcars must lie between 0 and 1

Consider the data frame data1

data1 <- data.frame(
  Count = c(0L, 3L, 3L, 0L), 
  LocationX = c(2000, NA, 2001, NA), 
  Extra = TRUE)

The following datacheckr code states that data1 should have a column Count of non-missing integers with values of 0, 1 or 3, should not have a column Comments and can include a column LocationX with missing values between 1012 and 2345.

check_data(data1, list(
  Count = c(0L, 1L, 3L), 
  Comments = NULL, 
  LocationX = c(NA, 2345, 1012),
  LocationX = NULL))
#> Warning: 'check_data' is deprecated.
#> Use 'check_data1' instead.
#> See help("Deprecated")

To produce similar functionality with assertr would require something like (please file an issue if the code below can be improved)

library(magrittr) # for the piping operator
data1 %>% assert(in_set(0, 1, 3), Count) %>%
  assert_rows(num_row_NAs, within_bounds(0,0.1), Count)
stopifnot(!"Comments" %in% colnames(data1))
if ("LocationX" %in% colnames(data1))
  data1 %>% assert(within_bounds(1012, 2345), LocationX)

which is in my opinion less intuitive.

The above checks can be performed on several data frames by simply repeatedly calling check_data()

data3 <- data2 <- data1
 
values <- list(
  Count = c(0L, 1L, 3L), 
  Comments = NULL, 
  LocationX = c(NA, 2345, 1012),
  LocationX = NULL)
 
check_data(data1, values)
#> Warning: 'check_data' is deprecated.
#> Use 'check_data1' instead.
#> See help("Deprecated")
check_data(data2, values)
#> Warning: 'check_data' is deprecated.
#> Use 'check_data1' instead.
#> See help("Deprecated")
check_data(data3, values)
#> Warning: 'check_data' is deprecated.
#> Use 'check_data1' instead.
#> See help("Deprecated")

The same tests using assertr would require the assertr code above to be copied and pasted three times which is tedious to produce and read; and as a result error prone.

To install the release version from CRAN

install.packages("datacheckr")

Or the development version from GitHub

# install.packages("devtools")
devtools::install_github("poissonconsulting/datacheckr")

Please report any issues.

Pull requests are always welcome.

News

NEWS datacheckr

  • Add function check_unique() to confirm an object doesn't have any duplicated elements.
  • Fixed tests so that compatible with testthat v0.11.0.9000
  • Added arguments min_row = 0 and max_row = max_nrow() to check_data to allow checking of the number of rows.

  • Added argument key = character(0) to check_data to allow checking of a unique key.

  • Added stricter variants of check_data called check_data2 and check_data3 as well as an alias check_data1 for check_data and deprecated check_data.

  • Added function check_vector to check the class and values of a vector.

  • Added function check_scalar to check the class and values of a scalar.

  • Added functions check_flag, check_int, check_count, check_string, check_date and check_time functions for specific scalars.

  • Added function check_data_frame to check if an object is a data frame.

  • Added function check_rows to check the number of rows in a data frame.

  • Added function check_cols to check the names of columns in a data frame.

  • Added function check_values to check the classes and values of the columns in a data frame.

  • Added function check_key to check that particular columns represent unique keys.

  • Added function check_join to check a many-to-one join between two data frames.

  • Added wrapper function max_nrow() to return the theoretical maximum number of rows.

  • Added function min_integer as wrapper for -.Machine$integer.max.

  • Value checking now works with columns inheriting from base classes i.e. ordered factors.

  • Added argument data_name = substitute(data) to all functions so users can overide the name of data.

  • Added vignette datacheckr.
  • The copy of the original data frame returned by check_data is now invisible.
  • Added function max_integer as wrapper for .Machine$integer.max.
  • On fail check_data now lists specific permitted values if 5 or less (if 3 or less for character)
  • Initial Release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("datacheckr")

0.1.2 by Joe Thorley, 4 months ago


https://github.com/poissonconsulting/datacheckr


Browse source code at https://github.com/cran/datacheckr


Authors: Joe Thorley [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports dplyr, magrittr

Suggests assertr, knitr, nycflights13, rmarkdown, testthat


Imported by rpdo, rtide.


See at CRAN