A collection of regular expression tools associated with the 'qdap' package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, citations, person tags, phone numbers, times, and zip codes.
qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, citations, person tags, phone numbers, times, and zip codes. Functions that remove/replace are prefixed with rm_
. Each of these functions has an extraction counterpart prefixed with ex_
.
The qdapRegex package does not aim to compete with string manipulation packages such as stringr or stringi but is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R's own regular expression functions, or add on string manipulation packages such as stringr and stringi.
The functions in qdapRegex work on a dictionary system. The current implementation defaults to a United States flavor of canned regular expressions. Users may submit proposed region specific regular expression dictionaries that contain the same fields as the regex_usa
data set or improvements to regular expressions in current dictionaries. Please submit proposed regional regular expression dictionaries via: https://github.com/trinker/qdapRegex/issues
The qdapRegex package serves a dual purpose of being both functional and educational. While the canned regular expressions are useful in and of themselves they also serve as a platform for understanding regular expressions in the context of meaningful, purposeful usage. In the same way I learned guitar while trying to mimic Eric Clapton, not by learning scales and theory, some folks may enjoy an approach of learning regular expressions in a more pragmatic, experiential interaction. Users are encouraged to look at the regular expressions being used (?regex_usa
and ?regex_supplement
are the default regular expression dictionaries used by qdapRegex) and unpack how they work. I have found slow repeated exposures to information in a purposeful context results in acquired knowledge.
The following regular expressions sites were very helpful to my own regular expression education:
Being able to discuss and ask questions is also important to learning...in this case regular expressions. I have found the following forums extremely helpful to learning about regular expressions:
To download the development version of qdapRegex:
Download the zip ball or tar ball, decompress and run R CMD INSTALL
on it, or use the pacman package to install the development version:
if (!require("pacman")) install.packages("pacman")pacman::p_load_gh("trinker/qdapRegex")
You are welcome to:
The following examples demonstrate some of the functionality of qdapRegex.
library(qdapRegex)
w <- c("Hello World (V. Raptor, 1986) bye (Foo, 2012, pp. 1-2)", "Narcissism is not dead (Rinker, 2014)", "The R Core Team (2014) has many members.", paste("Bunn (2005) said, \"As for elegance, R is refined, tasteful, and", "beautiful. When I grow up, I want to marry R.\""), "It is wrong to blame ANY tool for our own shortcomings (Baer, 2005).", "Wickham's (in press) Tidy Data should be out soon.", "Rinker's (n.d.) dissertation not so much.", "I always consult xkcd comics for guidance (Foo, 2012; Bar, 2014).", "Uwe Ligges (2007) says, \"RAM is cheap and thinking hurts\"", "Silly (Bar, 2014) stuff is what Bar (2014, 2012) said.") ex_citation(w)
## [[1]]
## [1] "V. Raptor, 1986" "Foo, 2012"
##
## [[2]]
## [1] "Rinker, 2014"
##
## [[3]]
## [1] "The R Core Team (2014)"
##
## [[4]]
## [1] "Bunn (2005)"
##
## [[5]]
## [1] "Baer, 2005"
##
## [[6]]
## [1] "Wickham's (in press)"
##
## [[7]]
## [1] "Rinker's (n.d.)"
##
## [[8]]
## [1] "Foo, 2012" "Bar, 2014"
##
## [[9]]
## [1] "Uwe Ligges (2007)"
##
## [[10]]
## [1] "Bar, 2014" "Bar (2014, 2012)"
as_count(ex_citation(w))
## Author Year n
## 7 Bar 2014 3
## 6 Foo 2012 2
## 2 Baer 2005 1
## 5 Bar 2012 1
## 3 Bunn 2005 1
## 8 Rinker 2014 1
## 11 Rinker n.d. 1
## 9 The R Core Team 2014 1
## 4 Uwe Ligges 2007 1
## 1 V. Raptor 1986 1
## 10 Wickham in press 1
x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1") ex_hash(x)
## [[1]]
## [1] "#rstats" "#ggplot2"
##
## [[2]]
## [1] "#magrittr" "#pipeR" "#rstats"
##
## [[3]]
## [1] "#user2014"
ex_tag(x)
## [[1]]
## [1] "@hadley"
##
## [[2]]
## [1] "@timelyportfolio"
##
## [[3]]
## [1] "@ramnath_vaidya"
ex_url(x)
## [[1]]
## [1] NA
##
## [[2]]
## [1] "http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html"
##
## [[3]]
## [1] "http://ramnathv.github.io/user2014-rcharts/#1"
y <- c("I love chicken [unintelligible]!", "Me too! (laughter) It's so good.[interrupting]", "Yep it's awesome {reading}.", "Agreed. {is so much fun}") ex_bracket(y)
## [[1]]
## [1] "unintelligible"
##
## [[2]]
## [1] "laughter" "interrupting"
##
## [[3]]
## [1] "reading"
##
## [[4]]
## [1] "is so much fun"
ex_curly(y)
## [[1]]
## [1] NA
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] "reading"
##
## [[4]]
## [1] "is so much fun"
ex_round(y)
## [[1]]
## [1] NA
##
## [[2]]
## [1] "laughter"
##
## [[3]]
## [1] NA
##
## [[4]]
## [1] NA
ex_square(y)
## [[1]]
## [1] "unintelligible"
##
## [[2]]
## [1] "interrupting"
##
## [[3]]
## [1] NA
##
## [[4]]
## [1] NA
z <- c("-2 is an integer. -4.3 and 3.33 are not.", "123,456 is a lot more than -.2", "hello world -.q")rm_number(z)
## [1] "is an integer. and are not." "is a lot more than"
## [3] "hello world -.q"
ex_number(z)
## [[1]]
## [1] "-2" "-4.3" "3.33"
##
## [[2]]
## [1] "123,456" "-.2"
##
## [[3]]
## [1] NA
as_numeric(ex_number(z))
## [[1]]
## [1] -2.00 -4.30 3.33
##
## [[2]]
## [1] 123456.0 -0.2
##
## [[3]]
## [1] NA
x <- c( "I'm getting 3:04 AM just fine, but...", "for 10:47 AM I'm getting 0:47 AM instead.", "no time here", "Some time has 12:04 with no AM/PM after it", "Some time has 12:04 a.m. or the form 1:22 pm")ex_time(x)
## [[1]]
## [1] "3:04"
##
## [[2]]
## [1] "10:47" "0:47"
##
## [[3]]
## [1] NA
##
## [[4]]
## [1] "12:04"
##
## [[5]]
## [1] "12:04" "1:22"
as_time(ex_time(x))
## [[1]]
## [1] "00:03:04.0"
##
## [[2]]
## [1] "00:10:47.0" "00:00:47.0"
##
## [[3]]
## [1] NA
##
## [[4]]
## [1] "00:12:04.0"
##
## [[5]]
## [1] "00:12:04.0" "00:01:22.0"
as_time(ex_time(x), as.POSIXlt = TRUE)
## [[1]]
## [1] "2017-04-09 00:03:04 EDT"
##
## [[2]]
## [1] "2017-04-09 00:10:47 EDT" "2017-04-09 00:00:47 EDT"
##
## [[3]]
## [1] NA
##
## [[4]]
## [1] "2017-04-09 00:12:04 EDT"
##
## [[5]]
## [1] "2017-04-09 00:12:04 EDT" "2017-04-09 00:01:22 EDT"
x <- c( "I like 56 dogs!", "It's seventy-two feet from the px290.", NA, "What", "that1is2a3way4to5go6.", "What do you*% want? For real%; I think you'll see.", "Oh some <html>code</html> to remove") rm_non_words(x)
## [1] "I like dogs"
## [2] "It's seventy two feet from the px"
## [3] NA
## [4] "What"
## [5] "that is a way to go"
## [6] "What do you want For real I think you'll see"
## [7] "Oh some html code html to remove"
rm_nchar_words(rm_non_words(x), "1,2")
## [1] "like dogs"
## [2] "It's seventy two feet from the"
## [3] NA
## [4] "What"
## [5] "that way"
## [6] "What you want For real think you'll see"
## [7] "some html code html remove"
Releases will be numbered with the following semantic versioning format:
..
And constructed with the following guidelines:
BUG FIXES
rm_dollar
's regex now allows for commas in the dollar portion.NEW FEATURES
as_count
added to convert ex_citation
into counts of citations.MINOR FEATURES
ex_
added to compliment the rm_
function.IMPROVEMENTS
grab
and functions that use @rm_xxx
now work on ex_xxx
as well.CHANGES
explain
is fully functional again as http://rick.measham.id.au/paste/explain
is again functioning.NEW FEATURES
rm_
prefixed functions get an extraction counterpart prefixed with ex_
.ex_
functions directly without using the rm_
form
in the less convenient form of rm_xxx(extract = TRUE)
.BUG FIXES
rm_number
incorrectly did not handle multiple comma separated digits (see
issue #17). This behavior has been fixed and a unit test added to ensure
proper handling.BUG FIXES
rm_between
did not handle single quotation marks ('
) as both the left and
right boundary when extract = TRUE
. Related to issue #13NEW FEATURES
rm_transcript_time
added to remove transcript specific style of time stamp
tagging. See http://help-nv10mac.qsrinternational.com/desktop/procedures/import_audio_or_video_transcripts.htm for details.
as_time
and as_time2
added for use with rm_time
/rm_time_trnscript
.
These are convert to the standard HH:MM:SS.OS format and optionally converts
to as.POSIXlt
. The former outputs a list of vectors of times while the
later wraps as_time
with unlist
.
MINOR FEATURES
except_first
added to regex_supplement
dictionary to provide a means to
remove all occurrences of a character except the first appearance. Regex
from: http://stackoverflow.com/a/31458261/1000343
rm_between
and r_between_multiple
pick up a fixed
argument. Previously,
left
and right
boundaries containing regular expression special characters
were fixed by default (escaped). This did not allow for the powerful use of a
regular expression for left/right boundaries. The fixed = TRUE
behavior
is still the default but users can now set fixed = FALSE
to work with
regular expression boundaries. This new feature was inspired by @Ronak Shah's
StackOverflow question: http://stackoverflow.com/q/31623069/1000343
CHANGES
word_boundary
, word_boundary_left
, word_boundary_right
regexes in the
regex_supplement
did not include apostrophes as a viable word character.
Apostrophes are now included as a word character.
explain
no longer prints the regular expression explanation to the command
line. Instead the link to http://www.regexper.com is printed. This change
is because http://rick.measham.id.au/paste/explain no longer appears to be
working. The text explanation functionality will return if the website
becomes operational again or if a suitable substitute can be found.
BUG FIXES
rm_number
did not extract consecutive digits that aren't comma separated
without separating it into multiple strings. For example "12345" became
"123" "45". Also 444,44 will not be removed/extracted as it is not a valid
comma separated number. These behavior have been corrected and the unit test
now include these cases. Thanks to Jason Gray for the rework of the regex.
It is simpler and more accurate.
rm_between
did not handle quotation marks ("
) as both the left and right
boundary when extract = TRUE
. Bug reported by Tori Shannon,
http://stackoverflow.com/q/31119989/1000343, and addressed by Jason Gray. See
issue #13
NEW FEATURES
as_numeric
& as_numeric2
added for use with rm_number
. These are
wrappers for as.numeric(gsub(",", "", x))
. The former removes commas and
converts a list of vectors of strings to numeric. The later wraps
as_numeric
with unlist
.
rm_non_words
added to remove every any character that isn't a letter,
apostrophe, or single space.
The class extracted
has been added and is the output of a rm_xxx
function
when extract = TRUE
. This allows for the c.extracted
function to easily
turn the list
output into a character vector.
c.extracted
added to provide a quick unlist method for list
s of class
extracted
. The is less typing than unlist
for an approach that is used
often.
bind_or
added as a means of quickly wrapping multiple sub-expression
elements with left/right boundaries and then concatenate/joins the grouped
strings with regular expression or statement ("|").
MINOR FEATURES
punctuation
added to regex_supplement
dictionary for easy negation of
[:punct:]
class.BUG FIXES
explain
used message
to print to the console. explain
now returns an
object of the class explain
with its own print method which uses cat
rather than message
. Additionally, the characters +
and &
were not
handled correctly; this has been corrected.
Documentation for TC
"there is an incomplete sentence. It is as follows:
TC utilizes additional rules for capitalization beyond stri_trans_totitle
that includes..." (found by rmsharp). This has been corrected. See issue #8
cheat
(and accompanying regex_cheat
dictionary) contained misspellings in
the words greedy and beginning. This has been corrected.
rm_number
incorrectly handled numbers containing leading or trailing zeros.
See issue #9
rm_caps_phrases
could only extract/remove up to two "words" worth of capital
letter phrases at a time. See issue #11
NEW FEATURES
%+%
binary operator version of pastex(x, y, sep = "")
added to join
regular expressions together.
group_or
added as a means of quickly wrapping multiple sub-expression
elements with grouping parenthesis and then concatenate/joins the grouped
strings with regular expression or statement ("|").
rm_repeated_characters
added for removing/extracting/replacing words with
repeated characters (each repeated > 2 times). Regex pattern comes from:
StackOverflow's vks (http://stackoverflow.com/a/29438461/1000343).
rm_repeated_phrases
added for removing/extracting/replacing repeating
phrases (> 2 times). Regex pattern comes from:
StackOverflow's BrodieG (http://stackoverflow.com/a/28786617/1000343).
rm_repeated_words
added for removing/extracting/replacing repeating words
(> 2 times).
MINOR FEATURES
run_split
regex added to the regex_supplement
dictionary to split runs
into chunks.IMPROVEMENTS
Regular Expression Dictionaries (e.g., regex_usa
and regex_supplement
) are
now managed with the regexr package. This enables cleaner updating of the
regular expressions with easier to read structure. Longer files will be
stored in this format. Files located:
https://github.com/trinker/qdapRegex/tree/master/inst/regex_scripts
rm_caps_phrase
has a new regular expression that is more accurate and does
not pull trailing white space.
BUG FIXES
pastex
would throw a warning on a vector (e.g., pastex(letters)
). This
has been fixed.
youtube_id
was documented under qdap_usa
rather than qdap_supplement
and
contained an invalid hyperlink. This has been fixed.
rm_citation
contained a bug that would not operate on citations with a comma
in multiple authors before the and/& sign. See issue #4
NEW FEATURES
is.regex
added as a logical check of a regular expression's validy (conforms
to R's regular expression rules).
rm_postal_code
added for removing/extracting/replacing U.S. postal codes.
Case wrapper functions, TC
(title case), U
(upper case), and L
(lower
case) added for convenient case manipulation.
group
function added to allow for convenient wrapping of grouping
parenthesis around regular expressions.
rm_citation_tex
added to remove/extract/replace bibkey citations from a .tex
(LaTeX) file.
regex_cheat
data set and cheat
function added to act as a quick reference
for common regex task operations such a lookaheads.
rm_caps_phrase
added to supplement rm_caps
, extending the search to phases.
explain
added to view a visual representation of a regular expression using
http://www.regexper.com and http://rick.measham.id.au/paste/explain. Also
takes named regular expressions from the regex_usa
or other supplied
dictionary.
MINOR FEATURES
last_occurrence
regex added to the regex_supplement
dictionary to find the
last occurrence of delimiter.
word_boundary
, word_boundary_left
, and word_boundary_right
added to
regex_supplement
dictionary to provide a true word boundary. Regexes
adapted from: http://www.rexegg.com/regex-boundaries.html#real-word-boundary
rm_time2
regex added to the regex_usa
dictionary to find time + AM/PM
IMPROVEMENTS
The regex_usa
dictionary regular expressions: rm_hash
, rm_tag
, rm_tag2
and rm_between
pick up grouping that allows for replacement of individual
sections of the substring. See ?rm_hash
and ?rm_tag
for examples.
pastex
picks up a sep
argument to allow the user to choose what string
is used to separate the collapsed expressions.
rm_citation
, rm_citation2
, and rm_citation3
now attempt to include last
names that contain the lower case particles: von, van, de, da, and du.
CRAN fix for oldrel Windows. Updated to R version 3.1.0 in Description file.
NEW FEATURES
bind
added as a convenience function to add a left and right boundary to
each element of a character vector.First CRAN Release
NEW FEATURES
rm_citation
added for removing/extracting/replacing APA 6 style in-text
citations.
rm_white
and accompanying family of rm_white
functions added to remove
white space.
rm_non_ascii
added to remove non-ASCII characters from a string.
around_
added to extract word(s) around a given point.
pages
and pages2
added to the regex_supplement
data set for
removing/extracting/validating page numbers.
IMPROVEMENTS
rm_XXX
family of functions now use stringi::stri_extract_all_regex
as this
approach is much faster than the
regmatches(text.var, gregexpr(pattern, text.var, perl = TRUE))
approach.This package is a collection of regex tools associated with the qdap package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, person tags, phone numbers, times, and zip codes.