Data Checking

Checks column names, classes, values, keys and joins in data frames. Also checks length, class and values of vectors and scalars. Returns an informative error message if user-defined conditions are not met.

stability-stable Travis-CI Build Status AppVeyor Build Status codecov License: CC0 CRAN_Status_Badge CRAN Downloads


datacheckr is an R package to check the column names, classes, values, keys and joins in data frames. It provides an informative error message if a user-defined condition fails to be met otherwise it returns the object (so it can be used in pipes).


Consider the data frame my_data

my_data <- data_frame(
  Count = c(0L, 3L, 3L, 0L, NA), 
  Longitude = c(0, 0, 90, 90, 180), 
  Latitude = c(0, 90, 90.2, 100, -180),
  Type = factor(c("Good", "Bad", "Bad", "Bad", "Bad"), levels = c("Good", "Bad")),
  Extra = TRUE,
  Comments = c("In Greenwich", "Somewhere else", "I'm lost", "I didn't see any", "Help"))
#>   Count Longitude Latitude   Type Extra         Comments
#>   <int>     <dbl>    <dbl> <fctr> <lgl>            <chr>
#> 1     0         0      0.0   Good  TRUE     In Greenwich
#> 2     3         0     90.0    Bad  TRUE   Somewhere else
#> 3     3        90     90.2    Bad  TRUE         I'm lost
#> 4     0        90    100.0    Bad  TRUE I didn't see any
#> 5    NA       180   -180.0    Bad  TRUE             Help


To specify that my_data must contain a column called col1 of class integer use the check_data2 function

check_data2(my_data, values = list(col1 = integer()))
#> Error: my_data must have column 'col1'

Missing Values

To specify that a column cannot include missing values pass a single non-missing value (of the correct class)

check_data2(my_data, list(Count = 1L))
#> Error: column Count in my_data cannot include missing values

To specify that it can include missing values include an NA in the vector

check_data2(my_data, list(Count = c(1L, NA)))

and to specify that it can only include missing values pass an NA (of the correct class)

check_data2(my_data, list(Count = NA_integer_))
#> Error: column Count in my_data can only include missing values

Value Ranges

To indicate that the values must fall within a range use two non-missing values

check_data2(my_data, list(Count = c(0L, 2L)))
#> Error: column Count in my_data cannot include missing values

Specific Values

If particular values are required then specify them as a vector of three or more non-missing values

check_data2(my_data, list(Count = c(1L, 2L, 2L)))
#> Error: column Count in my_data cannot include missing values

The order of the values in an element is unimportant.

Numeric, Date and POSIXct

Numeric, Date and POSIXct columns have exactly the same behaviour regarding ranges and specific values as illustrated above using integers.


With logical values two non-missing values produce the same behaviour as three or more non-missing values.

For example to test for only FALSE values use

check_data2(my_data, list(Extra = c(FALSE, FALSE)))
#> Error: column Extra in my_data can only include FALSE values


The following requires that the values of Comments match both character elements which are treated as regular expressions

check_data2(my_data, list(Comments = c("e", "o")))
#> Error: column Comments in my_data contains strings that do not match both regular expressions 'e' and 'o'

with three or more non-missing character elements each value must match at least one of the elements which are treated as regular expressions.

check_data2(my_data, list(Comments = c("e", "o", "o")))

Regular expressions are matched using grepl with perl=TRUE.


To specify that Type should be a factor that includes "Bad1" and "Good" among its levels

check_data2(my_data, list(Type = factor(c("Bad1", "Good"))))
#> Error: column Type in my_data lacks factor levels 'Bad1' and 'Good'

And to specify the actual factor levels, pass three or more non-missing values

check_data2(my_data, list(Type = factor(c("Bad", "Good", "Good"))))
#> Error: column Type in my_data must be a factor with the levels 'Bad' and 'Good'

Column Names

Whereas check_data2() ignores unnamed columns and doesn't care about the order, check_data3() requires that column names match the names in values.

check_data3(my_data, list(Comments = character()))
#> Error: column names in my_data must be identical to 'Comments'

Missing Columns

In contrast, check_data1() can be used to test that specific columns are missing or that a column satisfies one of multiple conditions.

check_data1(my_data, list(Comments = NULL))
#> Error: my_data must not include column Comments
check_data1(my_data, list(Comments = integer(),
                          Comments = numeric()))
#> Error: column Comments in my_data must be of class 'integer' or 'numeric'

To specify that my_data can contain a column col1 that can be integer or numeric values the call would be

check_data1(my_data, list(
  col1 = integer(), 
  col1 = NULL, 
  col1 = numeric()))

Naming Objects

By default, datacheckr determines the name of an object based on the call. This results in uninformative error messages when used in a pipe

my_data %<>% check_data2(values = list(col1 = integer()))
#> Error: . must have column 'col1'

The argument data_name can be used to define the name

my_data %<>% check_data2(values = list(col1 = integer()), data_name = "d8r")
#> Error: d8r must have column 'col1'

Relational Data

Consider the relational data in the nycflights13 package.


The following code uses the check_data3 function to confirm that airlines has just two columns carrier and name, in that order, which are both character vectors and that carrier is unique (a key).

check_data3(airlines, list(carrier = "", 
                           name = ""),
            key = "carrier")

The next code checks that airports has the listed columns in that order and that faa is a unique character vector of three 'word characters', lat is a number between 0 and 90, alt is an integer between -100 and 10,000, and dst is a character vector with the possible values A, N or U.

check_data3(airports, list(faa = rep("^\\w{3,3}$",2),
                           name = "",
                           lat = c(0, 90),
                           lon = c(-180, 180),
                           alt = as.integer(c(-100, 10^5L)),
                           tz = c(-11, 11),
                           dst = rep("A|N|U", 2),
                           tzone = ""),
            key = "faa")

This checks that planes includes tailnum, engines and year (as using less strict check_data2) and that engines is 1, 2, 3 or 4, that year is an integer between 1956 and 2013 that can include missing values and tailnum (which consists of strings of 5 to 6 letter 'word characters') is the unique key.

check_data2(planes, list(tailnum = rep("^\\w{5,6}$",2),
                         engines = 1:4,
                         year = c(1956L, 2013L, NA)),
            key = "tailnum")

Selecting Columns

Weather has lots of columns. by setting select = TRUE in check_data3 we drop non-named columns and order to match values. The checks indicate that year is only 2013, and like month is a number but day and hour are integers (as expected)

weather %<>% check_data3(list(year = c(2013,2013),
                              month = c(1, 12),
                              day = c(1L, 31L),
                              hour = c(0L, 23L),
                              origin = rep("^\\w{3,3}$",2)),
                 select = TRUE)
#> # A tibble: 26,130 × 5
#>     year month   day  hour origin
#>    <dbl> <dbl> <int> <int>  <chr>
#> 1   2013     1     1     0    EWR
#> 2   2013     1     1     1    EWR
#> 3   2013     1     1     2    EWR
#> 4   2013     1     1     3    EWR
#> 5   2013     1     1     4    EWR
#> 6   2013     1     1     6    EWR
#> 7   2013     1     1     7    EWR
#> 8   2013     1     1     8    EWR
#> 9   2013     1     1     9    EWR
#> 10  2013     1     1    10    EWR
#> # ... with 26,120 more rows


Checking the referential integrity of the (many-to-one) join between flights and airlines is easy.

check_join(flights, airlines, join = "carrier")

In addition to tailnum, flights and planes have additional column with the same name.

check_join(flights, planes, join = "tailnum")
#> Error: flights and planes must not have additional matching columns

We can deal with this by setting extra = TRUE but the data fail referential integrity because we have planes without flights.

check_join(flights, planes, extra = TRUE, join = "tailnum")
#> Error: many-to-one join between flights and planes violates referential integrity


To install the most recent release from CRAN


To install the development version from GitHub

# install.packages("devtools")


Please report any issues.

Pull requests are always welcome.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.


NEWS datacheckr


  • Removed check_data which was deprecated in v0.1.0.
  • Removed all dependencies.
  • Moved vignettes into README


  • Add function check_unique() to confirm an object doesn't have any duplicated elements.


  • Fixed tests so that compatible with testthat v0.11.0.9000


  • Added arguments min_row = 0 and max_row = max_nrow() to check_data to allow checking of the number of rows.

  • Added argument key = character(0) to check_data to allow checking of a unique key.

  • Added stricter variants of check_data called check_data2 and check_data3 as well as an alias check_data1 for check_data and deprecated check_data.

  • Added function check_vector to check the class and values of a vector.

  • Added function check_scalar to check the class and values of a scalar.

  • Added functions check_flag, check_int, check_count, check_string, check_date and check_time functions for specific scalars.

  • Added function check_data_frame to check if an object is a data frame.

  • Added function check_rows to check the number of rows in a data frame.

  • Added function check_cols to check the names of columns in a data frame.

  • Added function check_values to check the classes and values of the columns in a data frame.

  • Added function check_key to check that particular columns represent unique keys.

  • Added function check_join to check a many-to-one join between two data frames.

  • Added wrapper function max_nrow() to return the theoretical maximum number of rows.

  • Added function min_integer as wrapper for -.Machine$integer.max.

  • Value checking now works with columns inheriting from base classes i.e. ordered factors.

  • Added argument data_name = substitute(data) to all functions so users can overide the name of data.


  • Added vignette datacheckr.
  • The copy of the original data frame returned by check_data is now invisible.
  • Added function max_integer as wrapper for .Machine$integer.max.
  • On fail check_data now lists specific permitted values if 5 or less (if 3 or less for character)


  • Initial Release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.2.0 by Joe Thorley, a year ago

Browse source code at

Authors: Joe Thorley [aut, cre]

Documentation:   PDF Manual  

MIT + file LICENSE license

Suggests knitr, nycflights13, rmarkdown, tibble, testthat

Imported by rtide.

See at CRAN