Read flat/tabular text files from disk (or a connection).
The goal of readr is to provide a fast and friendly way to read tabular data into R. The most important functions are:
readr is now available from CRAN.
You can try out the dev version with:
library(readr)library(dplyr)mtcars_path <- tempfile(fileext = ".csv")write_csv(mtcars, mtcars_path)# Read a csv file into a data frameread_csv(mtcars_path)# Read lines into a vectorread_lines(mtcars_path)# Read whole file into a single stringread_file(mtcars_path)
vignette("column-types") on how readr parses columns, and how you can override the defaults.
read_csv() produces a data frame with the following properties:
Characters are never automatically converted to factors (i.e. no more
stringsAsFactors = FALSE).
Valid column names are left as is, not munged into valid R identifiers
(i.e. there is no
check.names = TRUE). Missing column names are filled
X2 etc, and duplicated column names are deduplicated.
The data frame is given class
c("tbl_df", "tbl", "data.frame") so
if you also use dplyr you'll get an
Row names are never set.
If there are any problems parsing the file, the
read_ function will throw a warning telling you how many problems there are. You can then use the
problems() function to access a data frame that gives information about each problem:
df <- read_csv(col_types = "dd", col_names = c("x", "y"), skip = 1, "1,2a,b")#> Warning message: There were 2 problems. See problems(x) for more detailsproblems(df)#> row col expected actual#> 1 2 1 a double a#> 2 2 2 a double b
It's likely that there will be cases that you can never load without some manual regexp-based munging in R. Load those columns with
col_character(), fix them up as needed, then use
convert_types() to re-run the automated conversion on every character column in the data frame. Alternatively, you can use
parse_date() etc to parse a single character vector at a time.
Compared to the corresponding base functions, readr functions:
Use a consistent naming scheme for the parameters (e.g.
Are much faster (up to 10x faster).
Have a helpful progress bar if loading is going to take a while.
All functions work exactly the same way regardless of the current locale.
To override the US-centric defaults, use
data.table has a function similar to
read_csv() called fread. Compared to fread, readr:
Is slower (currently ~1.2-2x slower. If you want absolutely the best
Readr has a slightly more sophisticated parser, recognising both doubled ("""") and backslash escapes ("""). Readr allows you to read factors and date times directly from disk.
fread() saves you work by automatically guessing the delimiter, whether
or not the file has a header, how many lines to skip by default and
more. Readr forces you to supply these parameters.
The underlying designs are quite different. Readr is designed to be
general, and dealing with new types of rectangular data just requires
implementing a new tokenizer.
fread() is designed to be as fast as
fread() is pure C, readr is C++ (and Rcpp).
The process by which readr guesses the types of columns has received a substantial overhaul to make it easier to fix problems when the initial guesses aren't correct, and to make it easier to generate reproducible code. Now column specifications are printing by default when you read from a file:
challenge <- read_csv(readr_example("challenge.csv"))#> Parsed with column specification:#> cols(#> x = col_integer(),#> y = col_character()#> )
And you can extract those values after the fact with
spec(challenge)#> cols(#> x = col_integer(),#> y = col_character()#> )
This makes it easier to quickly identify parsing problems and fix them (#314). If the column specification is long, the new
cols_condense() is used to condense the spec by identifying the most common type and setting it as the default. This is particularly useful when only a handful of columns have a different type (#466).
You can also generating an initial specification without parsing the file using
Once you have figured out the correct column types for a file, it's often useful to make the parsing strict. You can do this either by copying and pasting the printed output, or for very long specs, saving the spec to disk with
write_rds(). In production scripts, combine this with
stop_for_problems() (#465): if the input data changes form, you'll fail fast with an error.
You can now also adjust the number of rows that readr uses to guess the column types with
challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1500)#> Parsed with column specification:#> cols(#> x = col_double(),#> y = col_date(format = "")#> )
You can now access the guessing algorithm from R.
guess_parser() will tell you which parser readr will select for a character vector (#377). We've made a number of fixes to the guessing algorithm:
extdata/challenge.csv which is carefully created to cause
problems with the default column type guessing heuristics.
Blank lines and lines with only comments are now skipped automatically without warning (#381, #321).
Single '-' or '.' are now parsed as characters, not numbers (#297).
Numbers followed by a single trailing character are parsed as character, not numbers (#316).
We now guess at times using the
time_format specified in the
We have made a number of improvements to the reification of the
col_names and the actual data:
col_types is too long, it is subsetted correctly (#372, @jennybc).
col_names is too short, the added names are numbered correctly
Missing colum name names are now given a default name (
X7 etc) (#318).
Duplicated column names are now deduplicated. Both changes generate a warning;
to suppress it supply an explicit
skip = 1 if there's
an existing ill-formed header).
col_types() accepts a named list as input (#401).
The date time parsers recognise three new format strings:
%I for 12 hour time format (#340).
%AT are "automatic" date and time parsers. They are both slightly
less flexible than previous defaults. The automatic date parser requires a
four digit year, and only accepts
/ as separators (#442). The
flexible time parser now requires colons between hours and minutes and
optional seconds (#424).
%Y are now strict and require 2 or 4 characters respectively.
Date and time parsing functions received a number of small enhancements:
hms objects rather than a custom
time class (#409).
It now correctly parses missing values (#398).
parse_date() returns a numeric vector (instead of an integer vector) (#357).
parse_datetime() gain an
argument to match all other parsers (#413).
If the format argument is omitted
date and time formats specified in the locale will be used. These now
You can now parse partial dates with
parse_date("2001", "%Y") returns
parse_number() is slightly more flexible - it now parses numbers up to the first ill-formed character. For example
parse_number("...3...") now return -3 and 3 respectively. We also fixed a major bug where parsing negative numbers yielded positive values (#308).
parse_logical() now accepts
1 as well as lowercase
read_file_raw() reads a complete file into a single raw vector (#451).
read_*() functions gain a
quoted_na argument to control whether missing
values within quotes are treated as missing values or as strings (#295).
write_excel_csv() can be used to write a csv file with a UTF-8 BOM at the
start, which forces Excel to read it as UTF-8 encoded (#375).
write_lines() writes a character vector to a file (#302).
write_file() to write a single character or raw vector
to a file (#474).
Experimental support for chunked reading a writing (
functions. The API is unstable and subject to change in the future (#427).
Printing double values now uses an implementation of the grisu3 algorithm which speeds up writing of large numeric data frames by ~10X. (#432) '.0' is appended to whole number doubles, to ensure they will be read as doubles as well. (#483)
readr imports tibble so that you get consistent
extdata/challenge.csv which is carefully created to cause
problems with the default column type guessing heuristics.
default_locale() now sets the default locale in
rather than regenerating it for each call. (#416).
locale() now automatically sets decimal mark if you set the grouping
mark. It throws an error if you accidentally set decimal and grouping marks
to the same character (#450).
read_*() can read into long vectors, substantially increasing the
number of rows you can read (#309).
read_*() functions return empty objects rather than signaling an error
when run on an empty file (#356, #441).
read_delim() gains a
trim_ws argument (#312, noamross)
read_fwf() received a number of improvements:
read_fwf() now can now reliably read only a partial set of columns
(#322, #353, #469)
fwf_widths() accepts negative column widths for compatibility with the
widths argument in
read.fwf() (#380, @leeper).
You can now read fixed width files with ragged final columns, by setting
the final end position in
fwf_positions() or final width in
NA (#353, @ghaarsma).
fwf_empty() does this automatically.
fwf_empty() can now skip commented lines by setting a
comment argument (#334).
read_lines() ignores embedded null's in strings (#338) and gains a
readr_example() makes it easy to access example files bundled with readr.
type_convert() now accepts only
NULL or a
cols specification for
write_csv() now invisibly return the input data frame
(as documented, #363).
Doubles are parsed with
boost::spirit::qi::long_double to work around a bug
in the spirit library when parsing large numbers (#412).
Fix bug when detecting column types for single row files without headers (#333).
readr now has a strategy for dealing with settings that vary from place to place: locales. The default locale is still US centric (because R itself is), but you can now easily override the default timezone, decimal separator, grouping mark, day & month names, date format, and encoding. This has lead to a number of changes:
all gain a
locale() controls all the input settings that vary from place-to-place.
parse_euro_double() have been deprecated.
decimal_mark parameter to
The default encoding is now UTF-8. To load files that are not
in UTF-8, set the
encoding parameter of the
guess_encoding() function uses stringi to help you figure out the
encoding of a file.
%b use the
month names (full and abbreviate) defined in the locale (#242).
They also inherit the tz from the locale, rather than using an
vignette("locales") for more details.
cols() lets you pick the default column type for columns not otherwise
explicitly named (#148). You can refer to parsers either with their full
col_character()) or their one letter abbreviation (e.g.
cols_only() allows you to load only named columns. You can also choose to
override the default column type in
read_fwf() is now much more careful with new lines. If a line is too short,
you'll get a warning instead of a silent mistake (#166, #254). Additionally,
the last column can now be ragged: the width of the last field is silently
extended until it hits the next line break (#146). This appears to be a
common feature of "fixed" width files in the wild.
comment argument allows you to ignore comments (#68).
trim_ws argument controls whether leading and trailing whitespace is
removed. It defaults to
Specifying the wrong number of column names, or having rows with an unexpected number of columns, generates a warning, rather than an error (#189).
Multiple NA values can be specified by passing a character vector to
na (#125). The default has been changed to
na = c("", "NA"). Specifying
na = "" now works as expected with character columns (#114).
vignette("column-types") which describes how the defaults work and how to override them (#122).
parse_character() gains better support for embedded nulls: any characters
after the first null are dropped with a warning (#202).
parse_double() no longer silently ignore trailing
letters after the number (#221).
col_time() allows you to parse times (hours, minutes,
seconds) into number of seconds since midnight. If the format is omitted, it
uses a flexible parser that looks for hours, then optional colon, then
minutes, then optional colon, then optional seconds, then optional am/pm
parse_datetime() no longer incorrectly reads partial dates (e.g. 19,
1900, 1900-01) (#136). These triggered common false positives and after
re-reading the ISO8601 spec, I believe they actually refer to periods of
time, and should not be translated in to a specific instant (#228).
Compound formats "%D", "%F", "%R", "%X", "%T", "%x" are now parsed correctly, instead of using the ISO8601 parser (#178, @kmillar).
"%." now requires a non-digit. New "%+" skips one or more non-digits.
You can now use
%p to refer to AM/PM (and am/pm) (#126).
%B formats (month and abbreviated month name) ignore case
when matching (#219).
Local (non-UTC) times with and without daylight savings are now parsed correctly (#120, @andres-s).
parse_number() is a somewhat flexible numeric parser designed to read
currencies and percentages. It only reads the first number from a string
(using the grouping mark defined by the locale).
parse_numeric() has been deprecated because the name is confusing -
it's a flexible number parser, not a parser of "numerics", as R collectively
calls doubles and integers. Use
As well as improvements to the parser, I've also made a number of tweaks to the heuristics that readr uses to guess column types:
col_guess() to explicitly guess column type.
Bumped up row inspection for column typing guessing from 100 to 1000.
The heuristics for guessing
col_double() are stricter.
Numbers with leading zeros now default to being parsed as text, rather than
as integers/doubles (#266).
A column is guessed as
col_number() only if it parses as a regular number
when you ignoring the grouping marks.
Now use R's platform independent
iconv wrapper, thanks to BDR (#149).
Pathological zero row inputs (due to empty input,
return zero row data frames (#119).
When guessing field types, and there's no information to go on, use character instead of logical (#124, #128).
col_types specification now understands
? (guess) and
- (skip) (#188).
count_fields() starts counting from 1, not 0 (#200).
format_delim() make it easy to render a csv or
delimited file into a string.
fwf_empty() now works correctly when
col_names supplied (#186, #222).
parse_*() gains a
na argument that allows you to specify which values
should be converted to missing.
problems() now reports column names rather than column numbers (#143).
Whenever there is a problem, the first five problems are printing out
in a warning message, so you can more easily see what's wrong.
read_*() throws a warning instead of an error is
specifies a non-existent column (#145, @alyst).
read_*() can read from a remote gz compressed file (#163).
read_delim() defaults to
escape_backslash = FALSE and
escape_double = TRUE for consistency.
n_max also affects the number
of rows read to guess the column types (#224).
read_lines() gains a progress bar. It now also correctly checks for
interrupts every 500,000 lines so you can interrupt long running jobs.
It also correctly estimates the number of lines in the file, considerably
speeding up the reading of large files (60s -> 15s for a 1.5 Gb file).
read_lines_raw() allows you to read a file into a list of raw vectors,
one element for each line.
trim_ws arguments, and removes missing
values before determining column types.
write_rds() all invisably return their
input so you can use them in a pipe (#290).
write_csv() to write any delimited format (#135).
write_tsv() is a helpful wrapper for tab separated files.
Quotes are only used when they're needed (#116): when the string contains a quote, the delimiter, a new line or NA.
Double vectors are saved using same amount of precision as
na argument that specifies how missing values should be written
POSIXt vectors are saved in a ISO8601 compatible format (#134).
No longer fails silently if it can't open the target for writing (#193, #172).
read_rds() wrap around
defaulting to no compression (#140, @nicolasCoutin).