A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package. All function and argument names (and positions) are consistent, all functions deal with "NA"'s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.
Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The stringr package provide a cohesive set of functions designed to make working with strings as easy as possible. If you’re not familiar with strings, the best place to start is the chapter on strings in R for Data Science.
stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. stringr focusses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine. If you find that stringr is missing a function that you need, try looking in stringi. Both packages share similar conventions, so once you’ve mastered stringr, you should find stringi similarly easy to use.
# Install the released version from CRAN:install.packages("stringr")# Install the cutting edge development version from GitHub:# install.packages("devtools")devtools::install_github("tidyverse/stringr")
All functions in stringr start with str_
and take a vector of strings
as the first argument.
x <- c("why", "video", "cross", "extra", "deal", "authority")str_length(x)#> [1] 3 5 5 5 4 9str_c(x, collapse = ", ")#> [1] "why, video, cross, extra, deal, authority"str_sub(x, 1, 2)#> [1] "wh" "vi" "cr" "ex" "de" "au"
Most string functions work with regular expressions, a concise language
for describing patterns of text. For example, the regular expression
"[aeiou]"
matches any single character that is a vowel:
str_subset(x, "[aeiou]")#> [1] "video" "cross" "extra" "deal" "authority"str_count(x, "[aeiou]")#> [1] 0 3 1 2 2 4
There are seven main verbs that work with patterns:
str_detect(x, pattern)
tells you if there’s any match to the
pattern.
str_detect(x, "[aeiou]")#> [1] FALSE TRUE TRUE TRUE TRUE TRUE
str_count(x, pattern)
counts the number of patterns.
str_count(x, "[aeiou]")#> [1] 0 3 1 2 2 4
str_subset(x, pattern)
extracts the matching components.
str_subset(x, "[aeiou]")#> [1] "video" "cross" "extra" "deal" "authority"
str_locate(x, pattern)
gives the position of the match.
str_locate(x, "[aeiou]")#> start end#> [1,] NA NA#> [2,] 2 2#> [3,] 3 3#> [4,] 1 1#> [5,] 2 2#> [6,] 1 1
str_extract(x, pattern)
extracts the text of the match.
str_extract(x, "[aeiou]")#> [1] NA "i" "o" "e" "e" "a"
str_match(x, pattern)
extracts parts of the match defined by
parentheses.
# extract the characters on either side of the vowelstr_match(x, "(.)[aeiou](.)")#> [,1] [,2] [,3]#> [1,] NA NA NA#> [2,] "vid" "v" "d"#> [3,] "ros" "r" "s"#> [4,] NA NA NA#> [5,] "dea" "d" "a"#> [6,] "aut" "a" "t"
str_replace(x, pattern, replacement)
replaces the matches with new
text.
str_replace(x, "[aeiou]", "?")#> [1] "why" "v?deo" "cr?ss" "?xtra" "d?al" "?uthority"
str_split(x, pattern)
splits up a string into multiple pieces.
str_split(c("a,b", "c,d,e"), ",")#> [[1]]#> [1] "a" "b"#>#> [[2]]#> [1] "c" "d" "e"
As well as regular expressions (the default), there are three other pattern matching engines:
fixed()
: match exact bytescoll()
: match human lettersboundary()
: match boundariesThe RegExplain RStudio addin provides a friendly interface for working with regular expressions and functions from stringr. This addin allows you to interactively build your regexp, check the output of common string matching functions, consult the interactive help pages, or use the included resources to learn regular expressions.
This addin can easily be installed with devtools:
# install.packages("devtools")devtools::install_github("gadenbuie/regexplain")
R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R.
Uses consistent function and argument names. The first argument is always the vector of strings to modify, which makes stringr work particularly well in conjunction with the pipe:
letters %>%.[1:10] %>%str_pad(3, "right") %>%str_c(letters[2:11])#> [1] "a b" "b c" "c d" "d e" "e f" "f g" "g h" "h i" "i j" "j k"
Simplifies string operations by eliminating options that you don’t need 95% of the time.
Produces outputs than can easily be used as inputs. This includes ensuring that missing inputs result in missing outputs, and zero length inputs result in zero length outputs.
str_interp()
now renders lists consistently independent on the presence of
additional placeholders (@amhrasmussen).
New str_starts()
and str_ends()
functions to detect patterns at the
beginning or end of strings (@jonthegeek, #258).
str_subset()
, str_detect()
, and str_which()
get negate
argument,
which is useful when you want the elements that do NOT match (#259,
@yutannihilation).
New str_to_sentence()
function to capitalize with sentence case
(@jonthegeek, #202).
str_replace_all()
with a named vector now respects modifier functions (#207)
str_trunc()
is once again vectorised correctly (#203, @austin3dickey).
str_view()
handles NA
values more gracefully (#217). I've also
tweaked the sizing policy so hopefully it should work better in notebooks,
while preserving the existing behaviour in knit documents (#232).
Error : object ‘ignore.case’ is not exported by 'namespace:stringr'
.
This is because the long deprecated str_join()
, ignore.case()
and
perl()
have now been removed.str_glue()
and str_glue_data()
provide convenient wrappers around
glue
and glue_data()
from the glue package
(#157).
str_flatten()
is a wrapper around stri_flatten()
and clearly
conveys flattening a character vector into a single string (#186).
str_remove()
and str_remove_all()
functions. These wrap
str_replace()
and str_replace_all()
to remove patterns from strings.
(@Shians, #178)
str_squish()
removes spaces from both the left and right side of strings,
and also converts multiple space (or space-like characters) to a single
space within strings (@stephlocke, #197).
str_sub()
gains omit_na
argument for ignoring NA
. Accordingly,
str_replace()
now ignores NA
s and keeps the original strings.
(@yutannihilation, #164)
str_trunc()
now preserves NAs (@ClaytonJY, #162)
str_trunc()
now throws an error when width
is shorter than ellipsis
(@ClaytonJY, #163).
Long deprecated str_join()
, ignore.case()
and perl()
have now been
removed.
str_match_all()
now returns NA if an optional group doesn't match
(previously it returned ""). This is more consistent with str_match()
and other match failures (#134).In str_replace()
, replacement
can now be a function that is called once
for each match and whose return value is used to replace the match.
New str_which()
mimics grep()
(#129).
A new vignette (vignette("regular-expressions")
) describes the
details of the regular expressions supported by stringr.
The main vignette (vignette("stringr")
) has been updated to
give a high-level overview of the package.
str_order()
and str_sort()
gain explicit numeric
argument for sorting
mixed numbers and strings.
str_replace_all()
now throws an error if replacement
is not a character
vector. If replacement
is NA_character_
it replaces the complete string
with replaces with NA
(#124).
All functions that take a locale (e.g. str_to_lower()
and str_sort()
)
default to "en" (English) to ensure that the default is consistent across
platforms.
Add sample datasets: fruit
, words
and sentences
.
fixed()
, regex()
, and coll()
now throw an error if you use them with
anything other than a plain string (#60). I've clarified that the replacement
for perl()
is regex()
not regexp()
(#61). boundary()
has improved
defaults when splitting on non-word boundaries (#58, @lmullen).
str_detect()
now can detect boundaries (by checking for a str_count()
> 0)
(#120). str_subset()
works similarly.
str_extract()
and str_extract_all()
now work with boundary()
. This is
particularly useful if you want to extract logical constructs like words
or sentences. str_extract_all()
respects the simplify
argument
when used with fixed()
matches.
str_subset()
now respects custom options for fixed()
patterns
(#79, @gagolews).
str_replace()
and str_replace_all()
now behave correctly when a
replacement string contains $
s, \\\\1
, etc. (#83, #99).
str_split()
gains a simplify
argument to match str_extract_all()
etc.
str_view()
and str_view_all()
create HTML widgets that display regular
expression matches (#96).
word()
returns NA
for indexes greater than number of words (#112).
stringr is now powered by stringi instead of base R regular expressions. This improves unicode and support, and makes most operations considerably faster. If you find stringr inadequate for your string processing needs, I highly recommend looking at stringi in more detail.
stringr gains a vignette, currently a straight forward update of the article that appeared in the R Journal.
str_c()
now returns a zero length vector if any of its inputs are
zero length vectors. This is consistent with all other functions, and
standard R recycling rules. Similarly, using str_c("x", NA)
now
yields NA
. If you want "xNA"
, use str_replace_na()
on the inputs.
str_replace_all()
gains a convenient syntax for applying multiple pairs of
pattern and replacement to the same vector:
input <- c("abc", "def")str_replace_all(input, c("[ad]" = "!", "[cf]" = "?"))
str_match()
now returns NA if an optional group doesn't match
(previously it returned ""). This is more consistent with str_extract()
and other match failures.
New str_subset()
keeps values that match a pattern. It's a convenient
wrapper for x[str_detect(x)]
(#21, @jiho).
New str_order()
and str_sort()
allow you to sort and order strings
in a specified locale.
New str_conv()
to convert strings from specified encoding to UTF-8.
New modifier boundary()
allows you to count, locate and split by
character, word, line and sentence boundaries.
The documentation got a lot of love, and very similar functions (e.g. first and all variants) are now documented together. This should hopefully make it easier to locate the function you need.
ignore.case(x)
has been deprecated in favour of
fixed|regex|coll(x, ignore.case = TRUE)
, perl(x)
has been deprecated in
favour of regex(x)
.
str_join()
is deprecated, please use str_c()
instead.
fixed path in str_wrap
example so works for more R installations.
remove dependency on plyr
Zero input to str_split_fixed
returns 0 row matrix with n
columns
Export str_join
new modifier perl
that switches to Perl regular expressions
str_match
now uses new base function regmatches
to extract matches -
this should hopefully be faster than my previous pure R algorithm
new str_wrap
function which gives strwrap
output in a more convenient
format
new word
function extract words from a string given user defined
separator (thanks to suggestion by David Cooper)
str_locate
now returns consistent type when matching empty string (thanks
to Stavros Macrakis)
new str_count
counts number of matches in a string.
str_pad
and str_trim
receive performance tweaks - for large vectors this
should give at least a two order of magnitude speed up
str_length returns NA for invalid multibyte strings
fix small bug in internal recyclable
function