Character String Processing Facilities

Allows for fast, correct, consistent, portable, as well as convenient character string/text processing in every locale and any native encoding. Owing to the use of the ICU library, the package provides R users with platform-independent functions known to Java, Perl, Python, PHP, and Ruby programmers. Available features include: pattern searching (e.g., with ICU Java-like regular expressions or the Unicode Collation Algorithm), random string generation, case mapping, string transliteration, concatenation, Unicode normalization, date-time formatting and parsing, etc.


News

               stringi package NEWS and CHANGELOG

===============================================================================

  • [REGEX] #147: regex look-behind assertions may fail to find a multibyte Unicode search pattern [solved in ICU4C 57m1, see http://bugs.icu-project.org/trac/ticket/11554]

  • [BUGFIX] round(), snprintf() is not C++98

  • [BUGFIX] #214: allow a regex pattern like .* to match an empty string.

  • [BUGFIX] #210: stri_replace_all_fixed(c("1", "NULL"), "NULL", NA) now results in c("1", NA).

  • [NEW FEATURE] #199: stri_sub<- now allows for ignoring NA locations (a new omit_na argument added).

  • [NEW FEATURE] #207: stri_sub<- now allows for substring insertions (via length=0).

  • [NEW FUNCTION] #124: stri_subset<- functions added.

  • [NEW FEATURE] #216: stri_detect, stri_subset, stri_subset<- gained a negate argument.

  • [NEW FUNCTION] #175: stri_join_list concatenates all strings in a list of character vectors. Useful with, e.g., stri_extract_all_regex, stri_extract_all_words etc.


  • [GENERAL] #88: C API is now available for use in, e.g., Rcpp packages, see https://github.com/gagolews/ExampleRcppStringi for an example.

  • [BUGFIX] #183: Floating point exception raised in stri_sub() and stri_sub<-() when to or length was a zero-length numeric vector.

  • [BUGFIX] #180: stri_c() warned incorrectly (recycling rule) when using more than 2 elements.


  • [BACKWARD INCOMPATIBILITY] stri_install_check() and stri_install_icudt() are now deprecated. From now on they are supposed to be used only by the stringi installer.

  • [BUGFIX] #176: a patch for sys/feature_tests.h no longer included (the original file was copyrighted by Sun Microsystems); fixed the Compiler or options invalid for pre-UNIX 03 X/Open applications and pre-2001 POSIX applications error by forcing (conditionally) _XPG6 conformance.

  • [BUGFIX] #174: stri_paste() did not generate any warning when the recycling rule is violated and sep=="".

  • [BUGFIX] #170: icu::setDataDirectory no longer called if our ICU source bundle is not used (this used to cause build problems on openSUSE).

  • [BUILD TIME] #169: ./configure now tries to switch to the "standard" C++ compiler if a C++11 one is not properly configured.

  • [BUILD TIME] configure.win (Biarch: TRUE) now mimics autoconf's AC_SUBST and AC_CONFIG_FILES so that the build process is now more similar across different platforms.

  • [NEW FEATURE] stri_info() now also gives information about which version of ICU4C is in use (system or bundle).


  • [BACKWARD INCOMPATIBILITY] The second argument to stri_pad_*() has been renamed width.

  • [GENERAL] #69: stringi is now bundled with ICU4C 55.1.

  • [NEW FUNCTIONS] stri_extract_*_boundaries() extract text between text boundaries.

  • [NEW FUNCTION] #46: stri_trans_char() is a stringi-flavoured chartr() equivalent.

  • [NEW FUNCTION] #8: stri_width() approximates the width of a string in a more Unicodish fashion than nchar(..., "width")

  • [NEW FEATURE] #149: stri_pad() and stri_wrap() now by default bases on code point widths instead of the number of code points. Moreover, the default behavior of stri_wrap() is now such that it does not get rid of non-breaking, zero width, etc. spaces

  • [NEW FEATURE] #133: stri_wrap() silently allows for width <= 0 (for compatibility with strwrap()).

  • [NEW FEATURE] #139: stri_wrap() gained a new argument: whitespace_only.

  • [NEW FUNCTIONS] #137: date-time formatting/parsing:

    • stri_timezone_list() - lists all known time zone identifiers
    • stri_timezone_set(), stri_timezone_get() - manage current default time zone
    • stri_timezone_info() - basic information on a given time zone
    • stri_datetime_symbols() - localizable date-time formatting data
    • stri_datetime_fstr() - convert a strptime-like format string to an ICU date/time format string
    • stri_datetime_format() - convert date/time to string
    • stri_datetime_parse() - convert string to date/time object
    • stri_datetime_create() - construct date-time objects from numeric representations
    • stri_datetime_now() - return current date-time
    • stri_datetime_fields() - get values for date-time fields
    • stri_datetime_add() - add specific number of date-time units to a date-time object
  • [GENERAL] #144: Performance improvements in handling ASCII strings (these affect stri_sub(), stri_locate() and other string index-based operations)

  • [GENERAL] #143: Searching for short fixed patterns (stri_*_fixed()) now relies on the current libC's implementation of strchr() and strstr(). This is very fast e.g. on glibc utilizing the SSE2/3/4 instruction set.

  • [BUILD TIME] #141: a local copy of icudt*.zip may be used on package install; see the INSTALL file for more information.

  • [BUILD TIME] #165: the ./configure option --disable-icu-bundle forces the use of system ICU when building the package.

  • [BUGFIX] locale specifiers are now normalized in a more intelligent way: e.g. @calendar=gregorian expands to DEFAULT_LOCALE@calendar=gregorian.

  • [BUGFIX] #134: stri_extract_all_words() did not accept simplify=NA.

  • [BUGFIX] #132: incorrect behavior in stri_locate_regex() for matches of zero lengths

  • [BUGFIX] stringr/#73: stri_wrap() returned CHARSXP instead of STRSXP on empty string input with simplify=FALSE argument.

  • [BUGFIX] #164: libicu-dev usage used to fail on Ubuntu (LIBS shall be passed after LDFLAGS and the list of .o files).

  • [BUGFIX] #168: Build now fails if icudt is not available.

  • [BUGFIX] #135: C++11 is now used by default (see the INSTALL file, however) to build stringi from sources. This is because ICU4C uses the long long type which is not part of the C++98 standard.

  • [BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly.

  • [BUGFIX] Force ICU u_init() call on stringi dynlib load.

  • [BUGFIX] #157: many overfull hboxes in the package PDF manual has been corrected.


  • [IMPORTANT CHANGE] n_max argument in stri_split_*() has been renamed n.

  • [IMPORTANT CHANGE] simplify=FALSE in stri_extract_all_*() and stri_split_*() now calls stri_list2matrix() with fill="". fill=NA_character_ may be obtained by using simplify=NA.

  • [IMPORTANT CHANGE, NEW FUNCTIONS] #120: stri_extract_words has been renamed stri_extract_all_words and stri_locate_boundaries - stri_locate_all_boundaries as well as stri_locate_words - stri_locate_all_words. New functions are now available: stri_locate_first_boundaries, stri_locate_last_boundaries, stri_locate_first_words, stri_locate_last_words, stri_extract_first_words, stri_extract_last_words.

  • [IMPORTANT CHANGE] #111: opts_regex, opts_collator, opts_fixed, and opts_brkiter can now be supplied individually via .... In other words, you may now simply call e.g. stri_detect_regex(str, pattern, case_insensitive=TRUE) instead of stri_detect_regex(str, pattern, opts_regex=stri_opts_regex(case_insensitive=TRUE)).

  • [NEW FEATURE] #110: Fixed pattern search engine's settings can now be supplied via opts_fixed argument in stri_*_fixed(), see stri_opts_fixed(). A simple (not suitable for natural language processing) yet very fast case_insensitive pattern matching can be performed now. stri_extract_*_fixed is again available.

  • [NEW FEATURE] #23: stri_extract_all_fixed, stri_count, and stri_locate_all_fixed may now also look for overlapping pattern matches, see ?stri_opts_fixed.

  • [NEW FEATURE] #129: stri_match_*_regex gained a cg_missing argument.

  • [NEW FEATURE] #117: stri_extract_all_*(), stri_locate_all_*(), stri_match_all_*() gained a new argument: omit_no_match. Setting it to TRUE makes these functions compatible with their stringr equivalents.

  • [NEW FEATURE] #118: stri_wrap() gained indent, exdent, initial, and prefix arguments. Moreover Knuth's dynamic word wrapping algorithm now assumes that the cost of printing the last line is zero, see #128.

  • [NEW FEATURE] #122: stri_subset() gained an omit_na argument.

  • [NEW FEATURE] stri_list2matrix() gained an n_min argument.

  • [NEW FEATURE] #126: stri_split() now is also able to act just like stringr::str_split_fixed().

  • [NEW FEATURE] #119: stri_split_boundaries() now have n, tokens_only, and simplify arguments. Additionally, stri_extract_all_words() is now equipped with simplify arg.

  • [NEW FEATURE] #116: stri_paste() gained a new argument: ignore_null. Setting it to TRUE makes this function more compatible with paste().

  • [OTHER] #123: useDynLib is used to speed up symbol look-up in the compiled dynamic library.

  • [BUGFIX] #114: stri_paste(): could return result in an incorrect order.

  • [BUGFIX] #94: Run-time errors on Solaris caused by setting -DU_DISABLE_RENAMING=1 -- memory allocation errors in i.a. ICU's UnicodeString. This setting also caused some ABSan sanity check failures within ICU code.


  • [IMPORTANT CHANGE] #87: %>% overlapped with the pipe operator from the magrittr package; now each operator like %>% has been renamed %s>%.

  • [IMPORTANT CHANGE] #108: Now the BreakIterator (for text boundary analysis) may be better controlled via stri_opts_brkiter() (see options type and locale which aim to replace now-removed boundary and locale parameters to stri_locate_boundaries, stri_split_boundaries, stri_trans_totitle, stri_extract_words, stri_locate_words).

  • [NEW FUNCTIONS] #109: stri_count_boundaries() and stri_count_words() count the number of text boundaries in a string.

  • [NEW FUNCTIONS] #41: stri_startswith_*() and stri_endswith_*() determine whether a string starts or ends with a given pattern.

  • [NEW FEATURE] #102: stri_replace_all_*() gained a vectorize_all() parameter, which defaults to TRUE for backward compatibility.

  • [NEW FUNCTION] #91: stri_subset_*(), a convenient and more efficient substitute for str[stri_detect_*(str, ...)], added.

  • [NEW FEATURE] #100: stri_split_fixed(), stri_split_charclass(), stri_split_regex(), stri_split_coll() gained a tokens_only parameter, which defaults to FALSE for backward compatibility.

  • [NEW FUNCTION] #105: stri_list2matrix() converts lists of atomic vectors to character matrices, useful in connection with stri_split and stri_extract.

  • [NEW FEATURE] #107: stri_split_*() now allow setting an omit_empty=NA argument.

  • [NEW FEATURE] #106: stri_split() and stri_extract_all() gained a simplify argument (if TRUE, then stri_list2matrix(..., byrow=TRUE) is called on the resulting list.

  • [NEW FUNCTION] #77: stri_rand_lipsum() generates (pseudo)random dummy lorem ipsum text.

  • [NEW FEATURE] #98: stri_trans_totitle() gained a opts_brkiter parameter; it indicates which ICU BreakIterator should be used when performing case mapping.

  • [NEW FEATURE] stri_wrap() gained a new parameter: normalize.

  • [BUGFIX] #86: stri_*_fixed(), stri_*_coll(), and stri_*_regex() could give incorrect results if one of search strings were of length 0.

  • [BUGFIX] #99: stri_replace_all() did not use the replacement arg.

  • [BUGFIX] #112: Some of the objects were not PROTECTed from being garbage collected, which might have caused spontaneous SEGFAULTS.

  • [BUGFIX] Some collator's options were not passed correctly to ICU services.

  • [BUGFIX] Memory leaks causes as detected by valgrind --tool=memcheck --leak-check=full have been removed.

  • [DOCUMENTATION] Significant extensions/clean ups in the stringi manual.


  • icudt-dependent examples are no longer run if icudt is not available.

  • [BUGFIX] Issues with loading of misaligned addresses in stri_*_fixed().

  • [IMPORTANT CHANGE] stri_cmp*() now do not allow for passing opts_collator=NA. From now on, stri_cmp_eq, stri_cmp_neq, and the new operators %===%, %!==%, %stri===%, and %stri!==% are locale-independent operations, which base on code point comparisons. New functions stri_cmp_equiv and stri_cmp_nequiv (and from now on also %==%, %!=%, %stri==%, and %stri!=%) test for canonical equivalence.

  • [IMPORTANT CHANGE] stri_*_fixed() search functions now perform a locale-independent exact (byte-wise, of course after conversion to UTF-8) pattern search. All the Collator-based, locale-dependent search routines are now available via stri_*_coll(). The reason for this is that ICU USearch has currently very poor performance and in many search tasks in fact it is sufficient to do exact pattern matching.

  • [GENERAL] stri_*_fixed now use a tweaked Knuth-Morris-Pratt search algorithm, which improves the search performance drastically.

  • [IMPORTANT CHANGE] stri_enc_nf*() and stri_enc_isnf*() function families have been renamed to stri_trans_nf*() and stri_trans_isnf*(), respectively. This is because they deal with text transforming, and not with character encoding. Moreover, all such operations may also be performed by ICU's Transliterator (see below).

  • [NEW FUNCTION] stri_trans_general() and stri_trans_list() give access to ICU's Transliterator: may be used to perform very general text transforms.

  • [NEW FUNCTION stri_split_boundaries() utilizes ICU's BreakIterator to split strings at specific text boundaries. Moreover, stri_locate_boundaries indicates positions of these boundaries.

  • [NEW FUNCTION] stri_extract_words() uses ICU's BreakIterator to extract all words from a text. Additionally, stri_locate_words locates start and end positions of words in a text.

  • [NEW FUNCTION] stri_pad(), stri_pad_left(), stri_pad_right(), and stri_pad_both pad a string with a specific code point.

  • [NEW FUNCTION] stri_wrap() breaks paragraphs of text into lines. Two algorithms (greedy and minimal raggedness) are available.

  • [IMPORTANT CHANGE] stri_*_charclass() search functions now rely solely on ICU's UnicodeSet patterns. All previously accepted charclass identifiers became invalid. However, new patterns should now be more familiar to the users (they are regex-like). Moreover, we observe a very nice performance gain.

  • [IMPORTANT CHANGE] stri_sort() now does not include NAs in output vectors by default, for compatibility with sort(). Moreover, currently none of the input vector's attributes are preserved.

  • [NEW FUNCTION] stri_unique() extracts unique elements from a character vector.

  • [NEW FUNCTIONS] stri_duplicated() and stri_duplicated_any() determine duplicate elements in a character vector.

  • [NEW FUNCTION] stri_replace_na() replaces NAs in a character vector with a given string, useful for emulating e.g. R's paste() behavior.

  • [NEW FUNCTION] stri_rand_shuffle() generates a random permutation of code points in a string.

  • [NEW FUNCTION] stri_rand_strings() generates random strings.

  • [NEW FUNCTIONS] New functions and binary operators for string comparison: stri_cmp_eq(), stri_cmp_neq(), stri_cmp_lt(), stri_cmp_le(), stri_cmp_gt(), stri_cmp_ge(), %==%, %!=%, %<%, %<=%, %>%, %>=%.

  • [NEW FUNCTION] stri_enc_mark() reads declared encodings of character strings as seen by stringi.

  • [NEW FUNCTION] stri_enc_tonative(str) is an alias to stri_encode(str, NULL, NULL).

  • [NEW FEATURE] stri_order() and stri_sort() now have an additional argument na_last (defaults to TRUE and NA, respectively).

  • [NEW FEATURE] stri_replace_all_charclass(), stri_extract_all_charclass(), and stri_locate_all_charclass() now have a new arg, merge (defaults to FALSE for backward-compatibility). It may be used to e.g. replace sequences of white spaces with a single space.

  • [NEW FEATURE] stri_enc_toutf8() now has a new validate arg (defaults to FALSE for backward-compatibility). It may be used in a (rare) case in which a user wants to fix an invalid UTF-8 byte sequence. stri_length (among others) now detect invalid UTF-8 byte sequences.

  • [NEW FEATURE] All binary operators %???% now also have aliases %stri???%.

  • [GENERAL] Performance improvements in StriContainerUTF8 and StriContainerUTF16 (they affect most other functions).

  • [GENERAL] Significant performance improvements in stri_join(), stri_flatten(), stri_cmp(), stri_trans_to*(), and others.

  • [GENERAL] Added 3rd mirror site for our icudt binary distribution.

  • U_MISSING_RESOURCE_ERROR message in StriException now suggests calling stri_install_check().

  • [BUGFIX] UTF-8 BOMs are now silently removed from input strings.

  • [BUGFIX] no more attempts to re-encode UTF-8 encoded strings if native encoding=UTF-8 in StriContainerUTF8.

  • [BUGFIX] possible memory leaks when throwing errors via Rf_error().

  • [BUGFIX] stri_order() and stri_cmp() could return incorrect results for opts_collator=NA.

  • [BUGFIX] stri_sort() did not guarantee to return strings in UTF-8.


  • LICENSE tweaks.

  • Initial CRAN release.


  • Fixed bugs detected with ASan and UBSan, e.g. fixed CharClass::gcmask type (enum -> uint32_t) (reported by UBSan).

  • Fixed array over-runs detected with valgrind in string8.h.

  • Fixed unitialized class fields in StriContainerUTF8 (reported by valgrind).


  • License changed to BSD-3-clause, COPYRIGHTS updated.

  • icudt is not shipped with stringi anymore; it is now downloaded in install.libs.R from one of our servers.

  • New functions: stri_install_check(), stri_install_icudt().


  • System ICU is used on systems which do have one (version >= 50 needed). ICU is autodetected with pkg-config in ./configure. Pass '--disable-pkg-config' to ./configure to force building ICU from sources.

  • icudt52b (custom subset) is now shipped with stringi (for big-endian, ASCII systems).


  • Fixed some Solaris-related issues while preparing stringi for CRAN submission.

  • ICU4C 52.1 sources included (common, i18n, stubdata + icu52dt.dat loaded dynamically). Compilation via Makevars.

  • stringi now does not depend on any external libraries.


  • ICU4C is now statically linked on Windows.

  • First OS X binary build.

  • The package is being intensively tested by our students @ FMIS WUT.


  • Using pkg-config via ./configure to look for ICU4C libs.

  • First Windows binary build.

  • Compilation passed on Oracle Sun Studio compiler collection.

  • By now we have implemented most of the functionality scheduled for milestone 0.1.


  • The stringi project has been established on GitHub.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("stringi")

1.1.5 by Marek Gagolewski, 19 days ago


http://www.gagolewski.com/software/stringi/ http://site.icu-project.org/ http://www.unicode.org/


Report a bug at http://github.com/gagolews/stringi/issues


Browse source code at https://github.com/cran/stringi


Authors: Marek Gagolewski [aut, cre], Bartek Tartanus [ctb], and other contributors (stringi source code); IBM and other contributors (ICU4C 55.1 source code); Unicode, Inc. (Unicode Character Database)


Documentation:   PDF Manual  


Task views: Natural Language Processing


file LICENSE license


Imports tools, utils, stats

System requirements: ICU4C (>= 52, optional)


Imported by CITAN, Ecfun, GetTDData, OpenML, RPresto, RcmdrPlugin.temis, RndTexExams, Rtextrankr, SpaDES, TeXCheckR, VarfromPDB, alakazam, assertive.strings, batchtools, bea.R, biomartr, censys, chinese.misc, countyweather, dplR, drake, easyformatr, epidata, eurostat, evaluator, eyelinker, farff, fiery, flippant, gdns, gfer, haploReconstruct, hrbrthemes, hyphenatr, ie2misc, inpdfr, mlr, mscstexta4r, neuropsychology, optigrab, pathological, placement, plumber, poio, qdapRegex, qlcData, quanteda, randstr, rangeBuilder, rattle, roxygen2, rprime, rprintf, rslp, searchable, sejmRP, sentimentr, shazam, stplanr, stringr, syllable, textclean, textshape, tidyr, tokenizers, wakefield, wand, wikilake, xgboost.

Suggested by CorporaCoCo, RInno, assignPOP, caRpools, genie, iemiscdata, qdap, readr, rebus.base, rebus.unicode, rvest, stylo, swirl, textmining, tm.plugin.alceste.


See at CRAN