Convenient Base R String Handling

Base R already ships with string handling capabilities 'out- of-the-box' but lacks streamlined function names and workflow. The 'stringi' ('stringr') package on the other hand has well named functions, extensive Unicode support and allows for a streamlined workflow. On the other hand it adds dependencies and regular expression interpretation between base R functions and 'stringi' functions might differ. This packages aims at providing a solution to the use case of unwanted dependencies on the one hand but the need for streamlined text processing on the other. The packages' functions are solely based on wrapping base R functions into 'stringr'/'stringi' like function names. Along the way it adds one or two extra functions and last but not least provides all functions as generics, therefore allowing for adding methods for other text structures besides plain character vectors.


Status

first stable release

Travis-CI Build Status codecov CRAN version

Version

0.1.12

License

MIT + file LICENSE
Peter Meissner [email protected] [aut, cre]

Description

Base R already ships with string handling capabilities 'out-of-the-box' but lacks streamlined function names and workflow. The 'stringi' ('stringr') package on the other hand has well named functions, extensive Unicode support and allows for a streamlined workflow. On the other hand it adds dependencies and regular expression interpretation between base R functions and 'stringi' functions might differ. This packages aims at providing a solution to the use case of unwanted dependencies on the one hand but the need for streamlined text processing on the other. The packages' functions are solely based on wrapping base R functions into 'stringr'/'stringi' like function names. Along the way it adds one or two extra functions and last but not least provides all functions as generics, therefore allowing for adding methods for other text structures besides plain character vectors.

This packages aims at:

  • no dependencies except what comes with base-R (i.e. there shall be no compilation needed)
  • for good or worse relying on base-R text handling mechanisms means getting base-R regular expression engine including the Perl mode
  • writing text handling functions as generics so that methods for all kind of structures containing text can be added
  • providing functions that allow for different character encodings
  • but rigorously defaulting to UTF-8 as expected input and default output therby enhancing cross platform cooperation
  • adding more power to general basic text handling functions by additional options (e.g. the text_read() function allows to read in and tokenize text in one function call)
  • and insisting on a flat interface - meaning that all functionality should come from functions and plain parameters (in contrast to e.g. parameters that need specialised functions or funtions outputs to perform)
  • adding further general text handling tools if there is a general enough purpose for the added function(ality)

This package does not aim at:

  • beeing fast (fast is good but will not be traded for the above listed aims - stringi might be your friend here)
  • beeing ultimativly compatible (compatible is good but again will not be traded for the above listed aims - again stringi might be your solution in that case)

Contribution

Note, that this package uses a Contributor Code of Conduct. By participating in this project you agree to abide by its terms: http://contributor-covenant.org/version/1/0/0/ (basically this should be a place were people get along with each other respectful and nice because it's simply more fun that way for everybody)

Contributions are very much welcome, e.g. in the form of:

Installation

devtools::install_github("petermeissner/stringb")
library(stringb)

Function list

library(stringb)
objects("package:stringb")
##  [4] "text_collapse"           "text_count"              "text_delete"            
##  [7] "text_detect"             "text_dup"                "text_eval"              
## [10] "text_extract"            "text_extract_all"        "text_extract_group"     
## [13] "text_extract_group_all"  "text_filter"             "text_grep"              
## [16] "text_grepl"              "text_grepv"              "text_length"            
## [19] "text_locate"             "text_locate_all"         "text_locate_group"      
## [22] "text_nchar"              "text_pad"                "text_read"              
## [25] "text_rep"                "text_replace"            "text_replace_all"       
## [28] "text_replace_group"      "text_replace_locates"    "text_show"              
## [31] "text_snippet"            "text_split"              "text_split_n"           
## [34] "text_sub"                "text_subset"             "text_tokenize"          
## [37] "text_tokenize_lines"     "text_tokenize_sentences" "text_tokenize_words"    
## [40] "text_to_lower"           "text_to_title_case"      "text_to_upper"          
## [43] "text_trim"               "text_which"              "text_which_value"       
## [46] "text_wrap"               "text_write"

Example Usage

library(stringb)
(test_file <- stringb:::test_file("rc_1_ch1.txt")) # just a file accompanying the package to test things
## [1] "/home/peter/R/x86_64-pc-linux-gnu-library/3.3/stringb/testfiles/rc_1_ch1.txt"
text_read( test_file, tokenize = "\\W", n=20)[67:79]
##  [1] "Project"   "Gutenberg" "License"   "included"  "with"      "this"      "eBook"     "or"       
##  [9] "online"    "at"        "www"       "gutenberg" "org"

Although, text_read() is just a wrapper araound readLines() it has become more powerful, consistent and streamlined by (1) always producing UTF-8 encoded character vectors, (2) allowing the usage of all readLines() options - e.g. n, (3) and adding further useful functionality like on-the-fly-tokenization.

library(stringb)
(test_file <- stringb:::test_file("rc_3.txt")) # just a file accompanying the package to test things
## [1] "/home/peter/R/x86_64-pc-linux-gnu-library/3.3/stringb/testfiles/rc_3.txt"
text          <- text_read(test_file, tokenize = "\\W+")
friday_occurs <- text_detect(text, "FRIDAY", ignore.case=TRUE)
 
plot(friday_occurs, type = "n")
abline(v=which(friday_occurs))
title("Friday Appearing in Robinson Crusoe")

Text_detect() is another example for a streamlined interface (easier to remeber than grepl) with all base-R whistles and bells still beeing there - almost all base-R pattern matching functions have the ignore.case options to make pattern matching case insensitive.

library(stringb)
(test_file <- stringb:::test_file("rc_3.txt")) # just a file accompanying the package to test things
## [1] "/home/peter/R/x86_64-pc-linux-gnu-library/3.3/stringb/testfiles/rc_3.txt"
text          <- text_read(test_file)
 
plot(text, pattern="Friday", ignore.case=TRUE)

A simple method to plot character vectors with pattern markup was added as well.

News

NEWS stringb

version 0.1.13 [2016-11-01 ...]

  • BUGFIXES

    • in contrast to DESCRIPTION specification package would not support R >= 3.0.0 since the strrep() dependecy was only introduced in R 3.2.5
  • FEATURE

  • DEVELOPMENT

version 0.1.12 [2016-10-31 ...]

  • BUGFIXES

  • FEATURE

    • CRAN submission
  • DEVELOPMENT

version 0.1.11 [2016-08-01 ...]

  • BUGFIXES

  • FEATURE

    • text_tokenize_sentences
  • DEVELOPMENT

version 0.1.10 [2016-08-01 ...]

  • BUGFIXES

    • text_locate_all() ignores ... (#10)
    • text_tokenize() (#12)
  • FEATURE

  • DEVELOPMENT

version 0.1.9 [2016-07-29 ...]

  • BUGFIXES

  • FEATURE

    • plot.character()
  • DEVELOPMENT

version 0.1.8 [2016-07-28 ...]

  • BUGFIXES

  • FEATURE

    • text_wrap()
    • text_pad()
    • text_split_n()
  • DEVELOPMENT

    • test : text_c()
    • test : text_collapse()
    • test : text_which()
    • test : text_subset()
    • test : text_filter()
    • test : text_which_value()
    • test : text_grep()
    • test : text_grepv()
    • test : text_grepl()
    • test : text_eval()
    • test : text_to_lower()

version 0.1.7 [2016-07-26 ...]

  • BUGFIXES

  • FEATURE

    • text_extract_group()
    • text_replace_group()
    • text_locate_group()
    • text_replace_locates()
    • text_extract() and text_extract_all() got invert parameter
  • DEVELOPMENT

    • helper : drop_non_group_matches()
    • helper : regmatches2()
    • helper : sequenize()
    • helper : de_sequenize()

version 0.1.6 [2016-07-25 ...]

  • BUGFIXES

  • FEATURE

    • text_sub()
    • text_subset()
    • text_filter()
    • text_c()
  • DEVELOPMENT

version 0.1.5 [2016-07-25 ...]

  • BUGFIXES
  • text_write : minor fixes
  • FEATURE

    • text_replace()
    • text_replace_all()
    • text_trim()
    • text_delete()
  • DEVELOPMENT

    • putting tests in separate files

version 0.1.4 [2016-07-22 ...]

  • BUGFIXES

  • FEATURE

    • text_collapse()
  • DEVELOPMENT

    • text_collapse() : restucture methods / recursive behaviour for lists

version 0.1.3 [2016-07-22 ...]

  • BUGFIXES

  • FEATURE

    • text_to_lower()
    • text_to_upper()
    • text_to_title_case()
  • DEVELOPMENT

version 0.1.2 [2016-07-22 ...]

  • BUGFIXES

  • FEATURE

    • text_dup()
  • DEVELOPMENT

    • adding code coverage

version 0.1.1 [2016-07-17 ...]

  • BUGFIXES

  • FEATURE

    • text_read() : pass through of options to readLines
    • text_write()
    • text_detect()
    • text_locate()
    • text_locate_all()
    • text_split()
    • text_count
  • DEVELOPMENT

version 0.1.0 [2016-07-15 ...]

  • BUGFIXES

  • FEATURE

  • DEVELOPMENT

    • forking string functions away from diffrprojects package
    • and cleaning up and restructuring the parent project

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("stringb")

0.1.13 by Peter Meissner, a year ago


https://github.com/petermeissner/stringb


Report a bug at https://github.com/petermeissner/stringb/issues


Browse source code at https://github.com/cran/stringb


Authors: Peter Meissner [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports graphics, tools, backports

Suggests testthat, knitr, rmarkdown


Depended on by diffrprojects, rtext.


See at CRAN