Regular Expression Removal, Extraction, and Replacement Tools

A collection of regular expression tools associated with the 'qdap' package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, citations, person tags, phone numbers, times, and zip codes.


Project Status: Active - The project has reached a stable, usable state and is being actively developed. Build Status Coverage Status DOI Version

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, citations, person tags, phone numbers, times, and zip codes. Functions that remove/replace are prefixed with rm_. Each of these functions has an extraction counterpart prefixed with ex_.

The qdapRegex package does not aim to compete with string manipulation packages such as stringr or stringi but is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R's own regular expression functions, or add on string manipulation packages such as stringr and stringi.

The functions in qdapRegex work on a dictionary system. The current implementation defaults to a United States flavor of canned regular expressions. Users may submit proposed region specific regular expression dictionaries that contain the same fields as the regex_usa data set or improvements to regular expressions in current dictionaries. Please submit proposed regional regular expression dictionaries via: https://github.com/trinker/qdapRegex/issues

Educational

The qdapRegex package serves a dual purpose of being both functional and educational. While the canned regular expressions are useful in and of themselves they also serve as a platform for understanding regular expressions in the context of meaningful, purposeful usage. In the same way I learned guitar while trying to mimic Eric Clapton, not by learning scales and theory, some folks may enjoy an approach of learning regular expressions in a more pragmatic, experiential interaction. Users are encouraged to look at the regular expressions being used (?regex_usa and ?regex_supplement are the default regular expression dictionaries used by qdapRegex) and unpack how they work. I have found slow repeated exposures to information in a purposeful context results in acquired knowledge.

The following regular expressions sites were very helpful to my own regular expression education:

  1. Regular-Expression.info
  2. Rex Egg
  3. Regular Expressions as used in R
  4. Debuggex (Visualizing Regex)

Being able to discuss and ask questions is also important to learning...in this case regular expressions. I have found the following forums extremely helpful to learning about regular expressions:

  1. Talk Stats + Posting Guidelines
  2. stackoverflow + Posting Guidelines

Installation

To download the development version of qdapRegex:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/qdapRegex")

Contact

You are welcome to:

Examples

The following examples demonstrate some of the functionality of qdapRegex.

library(qdapRegex)

Extract Citations

w <- c("Hello World (V. Raptor, 1986) bye (Foo, 2012, pp. 1-2)",
    "Narcissism is not dead (Rinker, 2014)",
    "The R Core Team (2014) has many members.",
    paste("Bunn (2005) said, \"As for elegance, R is refined, tasteful, and",
        "beautiful. When I grow up, I want to marry R.\""),
    "It is wrong to blame ANY tool for our own shortcomings (Baer, 2005).",
    "Wickham's (in press) Tidy Data should be out soon.",
    "Rinker's (n.d.) dissertation not so much.",
    "I always consult xkcd comics for guidance (Foo, 2012; Bar, 2014).",
    "Uwe Ligges (2007) says, \"RAM is cheap and thinking hurts\"",
    "Silly (Bar, 2014) stuff is what Bar (2014, 2012) said."
)
 
ex_citation(w)
## [[1]]
## [1] "V. Raptor, 1986" "Foo, 2012"      
## 
## [[2]]
## [1] "Rinker, 2014"
## 
## [[3]]
## [1] "The R Core Team (2014)"
## 
## [[4]]
## [1] "Bunn (2005)"
## 
## [[5]]
## [1] "Baer, 2005"
## 
## [[6]]
## [1] "Wickham's (in press)"
## 
## [[7]]
## [1] "Rinker's (n.d.)"
## 
## [[8]]
## [1] "Foo, 2012" "Bar, 2014"
## 
## [[9]]
## [1] "Uwe Ligges (2007)"
## 
## [[10]]
## [1] "Bar, 2014"        "Bar (2014, 2012)"
as_count(ex_citation(w))
##             Author     Year n
## 7              Bar     2014 3
## 6              Foo     2012 2
## 2             Baer     2005 1
## 5              Bar     2012 1
## 3             Bunn     2005 1
## 8           Rinker     2014 1
## 11          Rinker     n.d. 1
## 9  The R Core Team     2014 1
## 4       Uwe Ligges     2007 1
## 1        V. Raptor     1986 1
## 10         Wickham in press 1

Extract Twitter Hash Tags, Name Tags, & URLs

x <- c("@hadley I like #rstats for #ggplot2 work.",
    "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats:
        http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio",
    "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization
        presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1"
)
 
ex_hash(x)
## [[1]]
## [1] "#rstats"  "#ggplot2"
## 
## [[2]]
## [1] "#magrittr" "#pipeR"    "#rstats"  
## 
## [[3]]
## [1] "#user2014"
ex_tag(x)
## [[1]]
## [1] "@hadley"
## 
## [[2]]
## [1] "@timelyportfolio"
## 
## [[3]]
## [1] "@ramnath_vaidya"
ex_url(x)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] "http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html"
## 
## [[3]]
## [1] "http://ramnathv.github.io/user2014-rcharts/#1"

Extract Bracketed Text

y <- c("I love chicken [unintelligible]!", 
    "Me too! (laughter) It's so good.[interrupting]",
    "Yep it's awesome {reading}.", "Agreed. {is so much fun}")
 
ex_bracket(y)
## [[1]]
## [1] "unintelligible"
## 
## [[2]]
## [1] "laughter"     "interrupting"
## 
## [[3]]
## [1] "reading"
## 
## [[4]]
## [1] "is so much fun"
ex_curly(y)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] NA
## 
## [[3]]
## [1] "reading"
## 
## [[4]]
## [1] "is so much fun"
ex_round(y)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] "laughter"
## 
## [[3]]
## [1] NA
## 
## [[4]]
## [1] NA
ex_square(y)
## [[1]]
## [1] "unintelligible"
## 
## [[2]]
## [1] "interrupting"
## 
## [[3]]
## [1] NA
## 
## [[4]]
## [1] NA

Extract Numbers

z <- c("-2 is an integer.  -4.3 and 3.33 are not.",
    "123,456 is a lot more than -.2",
    "hello world -.q")
rm_number(z)
## [1] "is an integer. and are not." "is a lot more than"         
## [3] "hello world -.q"
ex_number(z)
## [[1]]
## [1] "-2"   "-4.3" "3.33"
## 
## [[2]]
## [1] "123,456" "-.2"    
## 
## [[3]]
## [1] NA
as_numeric(ex_number(z))
## [[1]]
## [1] -2.00 -4.30  3.33
## 
## [[2]]
## [1] 123456.0     -0.2
## 
## [[3]]
## [1] NA

Extract Times

x <- c(
    "I'm getting 3:04 AM just fine, but...",
    "for 10:47 AM I'm getting 0:47 AM instead.",
    "no time here",
    "Some time has 12:04 with no AM/PM after it",
    "Some time has 12:04 a.m. or the form 1:22 pm"
)
ex_time(x)
## [[1]]
## [1] "3:04"
## 
## [[2]]
## [1] "10:47" "0:47" 
## 
## [[3]]
## [1] NA
## 
## [[4]]
## [1] "12:04"
## 
## [[5]]
## [1] "12:04" "1:22"
as_time(ex_time(x))
## [[1]]
## [1] "00:03:04.0"
## 
## [[2]]
## [1] "00:10:47.0" "00:00:47.0"
## 
## [[3]]
## [1] NA
## 
## [[4]]
## [1] "00:12:04.0"
## 
## [[5]]
## [1] "00:12:04.0" "00:01:22.0"
as_time(ex_time(x), as.POSIXlt = TRUE)
## [[1]]
## [1] "2017-04-09 00:03:04 EDT"
## 
## [[2]]
## [1] "2017-04-09 00:10:47 EDT" "2017-04-09 00:00:47 EDT"
## 
## [[3]]
## [1] NA
## 
## [[4]]
## [1] "2017-04-09 00:12:04 EDT"
## 
## [[5]]
## [1] "2017-04-09 00:12:04 EDT" "2017-04-09 00:01:22 EDT"

Remove Non-Words & N Character Words

x <- c(
    "I like 56 dogs!",
    "It's seventy-two feet from the px290.",
    NA,
    "What",
    "that1is2a3way4to5go6.",
    "What do you*% want?  For real%; I think you'll see.",
    "Oh some <html>code</html> to remove"
)
 
rm_non_words(x)
## [1] "I like dogs"                                 
## [2] "It's seventy two feet from the px"           
## [3] NA                                            
## [4] "What"                                        
## [5] "that is a way to go"                         
## [6] "What do you want For real I think you'll see"
## [7] "Oh some html code html to remove"
rm_nchar_words(rm_non_words(x), "1,2")
## [1] "like dogs"                              
## [2] "It's seventy two feet from the"         
## [3] NA                                       
## [4] "What"                                   
## [5] "that way"                               
## [6] "What you want For real think you'll see"
## [7] "some html code html remove"

News

NEWS

Versioning

Releases will be numbered with the following semantic versioning format:

..

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc changes bumps the patch

qdapRegex 0.7.0 - 0.7.2

BUG FIXES

  • rm_dollar's regex now allows for commas in the dollar portion.

NEW FEATURES

  • as_count added to convert ex_citation into counts of citations.

MINOR FEATURES

  • ex_ added to compliment the rm_ function.

IMPROVEMENTS

  • grab and functions that use @rm_xxx now work on ex_xxx as well.

CHANGES

qdapRegex 0.6.0

NEW FEATURES

  • rm_ prefixed functions get an extraction counterpart prefixed with ex_.
    This means users can use ex_ functions directly without using the rm_ form in the less convenient form of rm_xxx(extract = TRUE).

qdapRegex 0.5.1

BUG FIXES

  • rm_number incorrectly did not handle multiple comma separated digits (see issue #17). This behavior has been fixed and a unit test added to ensure proper handling.

qdapRegex 0.4.1-0.5.0

BUG FIXES

  • rm_between did not handle single quotation marks (') as both the left and right boundary when extract = TRUE. Related to issue #13

NEW FEATURES

MINOR FEATURES

  • except_first added to regex_supplement dictionary to provide a means to remove all occurrences of a character except the first appearance. Regex from: http://stackoverflow.com/a/31458261/1000343

  • rm_between and r_between_multiple pick up a fixed argument. Previously, left and right boundaries containing regular expression special characters were fixed by default (escaped). This did not allow for the powerful use of a regular expression for left/right boundaries. The fixed = TRUE behavior is still the default but users can now set fixed = FALSE to work with regular expression boundaries. This new feature was inspired by @Ronak Shah's StackOverflow question: http://stackoverflow.com/q/31623069/1000343

CHANGES

  • word_boundary, word_boundary_left, word_boundary_right regexes in the regex_supplement did not include apostrophes as a viable word character. Apostrophes are now included as a word character.

  • explain no longer prints the regular expression explanation to the command line. Instead the link to http://www.regexper.com is printed. This change is because http://rick.measham.id.au/paste/explain no longer appears to be working. The text explanation functionality will return if the website becomes operational again or if a suitable substitute can be found.

qdapRegex 0.4.0

BUG FIXES

  • rm_number did not extract consecutive digits that aren't comma separated without separating it into multiple strings. For example "12345" became "123" "45". Also 444,44 will not be removed/extracted as it is not a valid comma separated number. These behavior have been corrected and the unit test now include these cases. Thanks to Jason Gray for the rework of the regex. It is simpler and more accurate.

  • rm_between did not handle quotation marks (") as both the left and right boundary when extract = TRUE. Bug reported by Tori Shannon, http://stackoverflow.com/q/31119989/1000343, and addressed by Jason Gray. See issue #13

NEW FEATURES

  • as_numeric & as_numeric2 added for use with rm_number. These are wrappers for as.numeric(gsub(",", "", x)). The former removes commas and converts a list of vectors of strings to numeric. The later wraps as_numeric with unlist.

  • rm_non_words added to remove every any character that isn't a letter, apostrophe, or single space.

  • The class extracted has been added and is the output of a rm_xxx function when extract = TRUE. This allows for the c.extracted function to easily turn the list output into a character vector.

  • c.extracted added to provide a quick unlist method for lists of class extracted. The is less typing than unlist for an approach that is used often.

  • bind_or added as a means of quickly wrapping multiple sub-expression elements with left/right boundaries and then concatenate/joins the grouped strings with regular expression or statement ("|").

MINOR FEATURES

  • punctuation added to regex_supplement dictionary for easy negation of [:punct:] class.

qdapRegex 0.2.1 - 0.3.2

BUG FIXES

  • explain used message to print to the console. explain now returns an object of the class explain with its own print method which uses cat rather than message. Additionally, the characters + and & were not handled correctly; this has been corrected.

  • Documentation for TC "there is an incomplete sentence. It is as follows: TC utilizes additional rules for capitalization beyond stri_trans_totitle that includes..." (found by rmsharp). This has been corrected. See issue #8

  • cheat (and accompanying regex_cheat dictionary) contained misspellings in the words greedy and beginning. This has been corrected.

  • rm_number incorrectly handled numbers containing leading or trailing zeros. See issue #9

  • rm_caps_phrases could only extract/remove up to two "words" worth of capital letter phrases at a time. See issue #11

NEW FEATURES

  • %+% binary operator version of pastex(x, y, sep = "") added to join regular expressions together.

  • group_or added as a means of quickly wrapping multiple sub-expression elements with grouping parenthesis and then concatenate/joins the grouped strings with regular expression or statement ("|").

  • rm_repeated_characters added for removing/extracting/replacing words with repeated characters (each repeated > 2 times). Regex pattern comes from: StackOverflow's vks (http://stackoverflow.com/a/29438461/1000343).

  • rm_repeated_phrases added for removing/extracting/replacing repeating phrases (> 2 times). Regex pattern comes from: StackOverflow's BrodieG (http://stackoverflow.com/a/28786617/1000343).

  • rm_repeated_words added for removing/extracting/replacing repeating words (> 2 times).

MINOR FEATURES

  • run_split regex added to the regex_supplement dictionary to split runs into chunks.

IMPROVEMENTS

  • Regular Expression Dictionaries (e.g., regex_usa and regex_supplement) are now managed with the regexr package. This enables cleaner updating of the regular expressions with easier to read structure. Longer files will be stored in this format. Files located: https://github.com/trinker/qdapRegex/tree/master/inst/regex_scripts

  • rm_caps_phrase has a new regular expression that is more accurate and does not pull trailing white space.

qdapRegex 0.1.3 - 0.2.0

BUG FIXES

  • pastex would throw a warning on a vector (e.g., pastex(letters)). This has been fixed.

  • youtube_id was documented under qdap_usa rather than qdap_supplement and contained an invalid hyperlink. This has been fixed.

  • rm_citation contained a bug that would not operate on citations with a comma in multiple authors before the and/& sign. See issue #4

NEW FEATURES

  • is.regex added as a logical check of a regular expression's validy (conforms to R's regular expression rules).

  • rm_postal_code added for removing/extracting/replacing U.S. postal codes.

  • Case wrapper functions, TC (title case), U (upper case), and L (lower case) added for convenient case manipulation.

  • group function added to allow for convenient wrapping of grouping parenthesis around regular expressions.

  • rm_citation_tex added to remove/extract/replace bibkey citations from a .tex (LaTeX) file.

  • regex_cheat data set and cheat function added to act as a quick reference for common regex task operations such a lookaheads.

  • rm_caps_phrase added to supplement rm_caps, extending the search to phases.

  • explain added to view a visual representation of a regular expression using http://www.regexper.com and http://rick.measham.id.au/paste/explain. Also takes named regular expressions from the regex_usa or other supplied dictionary.

MINOR FEATURES

  • last_occurrence regex added to the regex_supplement dictionary to find the last occurrence of delimiter.

  • word_boundary, word_boundary_left, and word_boundary_right added to regex_supplement dictionary to provide a true word boundary. Regexes adapted from: http://www.rexegg.com/regex-boundaries.html#real-word-boundary

  • rm_time2 regex added to the regex_usa dictionary to find time + AM/PM

IMPROVEMENTS

  • The regex_usa dictionary regular expressions: rm_hash, rm_tag, rm_tag2 and rm_between pick up grouping that allows for replacement of individual sections of the substring. See ?rm_hash and ?rm_tag for examples.

  • pastex picks up a sep argument to allow the user to choose what string is used to separate the collapsed expressions.

  • rm_citation, rm_citation2, and rm_citation3 now attempt to include last names that contain the lower case particles: von, van, de, da, and du.

qdapRegex 0.1.2

CRAN fix for oldrel Windows. Updated to R version 3.1.0 in Description file.

NEW FEATURES

  • bind added as a convenience function to add a left and right boundary to each element of a character vector.

qdapRegex 0.1.1

First CRAN Release

NEW FEATURES

  • rm_citation added for removing/extracting/replacing APA 6 style in-text citations.

  • rm_white and accompanying family of rm_white functions added to remove white space.

  • rm_non_ascii added to remove non-ASCII characters from a string.

  • around_ added to extract word(s) around a given point.

  • pages and pages2 added to the regex_supplement data set for removing/extracting/validating page numbers.

IMPROVEMENTS

  • rm_XXX family of functions now use stringi::stri_extract_all_regex as this approach is much faster than the regmatches(text.var, gregexpr(pattern, text.var, perl = TRUE)) approach.

qdapRegex 0.0.1 - 0.2.0

This package is a collection of regex tools associated with the qdap package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, person tags, phone numbers, times, and zip codes.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("qdapRegex")

0.7.2 by Tyler Rinker, 2 years ago


http://trinker.github.com/qdapRegex/


Report a bug at http://github.com/trinker/qdapRegex/issues


Browse source code at https://github.com/cran/qdapRegex


Authors: Jason Gray [ctb] , Tyler Rinker [aut, cre]


Documentation:   PDF Manual  


GPL-2 license


Imports stringi

Suggests testthat


Imported by edgar, momentuHMM, rdhs, textclean, vegtable.

Depended on by qdap.


See at CRAN