Unicode Text Processing

Process and print 'UTF-8' encoded international text (Unicode). Input, validate, normalize, encode, format, and display.

utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R's UTF-8 handling.


utf8 is available on CRAN. To install the latest released version, run the following command in R:


Development version

To install the latest development version, run the following:

tmp <- tempfile()
system2("git", c("clone", "--recursive", shQuote("https://github.com/patperry/r-utf8.git"), shQuote(tmp)))

Note that utf8 uses a git submodule, so you cannot use devtools::install_github.


Validate character data and convert to UTF-8

Use as_utf8 to validate input text and convert to UTF-8 encoding. The function alerts you if the input text has the wrong declared encoding:

# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails
#> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4
# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
#> [1] "façile" "façile" "façile"

Normalize data

Use utf8_normalize to convert to Unicode composed normal form (NFC). Optionally apply compatibility maps for NFKC normal form or case-fold.

# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
#> [1] "Å" "Å" "Å"
utf8_normalize(angstrom) == "\u00c5"
# perform full Unicode case-folding
utf8_normalize("Größe", map_case = TRUE)
#> [1] "grösse"
# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("𝖸𝗈 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝗅 𝗁𝖾𝗋𝖽 𝕌 𝗅𝗂𝗄𝖾 𝑡𝑦𝑝𝑒𝑓𝑎𝑐𝑒𝑠 𝗌𝗈 𝗐𝖾 𝗉𝗎𝗍 𝗌𝗈𝗆𝖾 𝚌𝚘𝚍𝚎𝚙𝚘𝚒𝚗𝚝𝚜 𝗂𝗇 𝗒𝗈𝗎𝗋 𝔖𝔲𝔭𝔭𝔩𝔢𝔪𝔢𝔫𝔱𝔞𝔯𝔶 𝔚𝔲𝔩𝔱𝔦𝔩𝔦𝔫𝔤𝔳𝔞𝔩 𝔓𝔩𝔞𝔫𝔢 𝗌𝗈 𝗒𝗈𝗎 𝖼𝖺𝗇 𝓮𝓷𝓬𝓸𝓭𝓮 𝕗𝕠𝕟𝕥𝕤 𝗂𝗇 𝗒𝗈𝗎𝗋 𝒇𝒐𝒏𝒕𝒔.",
               map_compat = TRUE)
#> [1] "Yo Unicode l herd U like typefaces so we put some codepoints in your Supplementary Wultilingval Plane so you can encode fonts in your fonts."

Print emoji

On some platforms (including MacOS), the R implementation of print uses an outdated version of the Unicode standard to determine which characters are printable. Use utf8_print for an updated print function:

print(intToUtf8(0x1F600 + 0:79)) # with default R print function
#> [1] "\U0001f600\U0001f601\U0001f602\U0001f603\U0001f604\U0001f605\U0001f606\U0001f607\U0001f608\U0001f609\U0001f60a\U0001f60b\U0001f60c\U0001f60d\U0001f60e\U0001f60f\U0001f610\U0001f611\U0001f612\U0001f613\U0001f614\U0001f615\U0001f616\U0001f617\U0001f618\U0001f619\U0001f61a\U0001f61b\U0001f61c\U0001f61d\U0001f61e\U0001f61f\U0001f620\U0001f621\U0001f622\U0001f623\U0001f624\U0001f625\U0001f626\U0001f627\U0001f628\U0001f629\U0001f62a\U0001f62b\U0001f62c\U0001f62d\U0001f62e\U0001f62f\U0001f630\U0001f631\U0001f632\U0001f633\U0001f634\U0001f635\U0001f636\U0001f637\U0001f638\U0001f639\U0001f63a\U0001f63b\U0001f63c\U0001f63d\U0001f63e\U0001f63f\U0001f640\U0001f641\U0001f642\U0001f643\U0001f644\U0001f645\U0001f646\U0001f647\U0001f648\U0001f649\U0001f64a\U0001f64b\U0001f64c\U0001f64d\U0001f64e\U0001f64f"
utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
#> [1] "😀​😁​😂​😃​😄​😅​😆​😇​😈​😉​😊​😋​😌​😍​😎​😏​😐​😑​😒​😓​😔​😕​😖​😗​😘​😙​😚​😛​😜​😝​😞​😟​😠​😡​😢​😣​😤​😥​😦​😧​😨​😩​😪​😫​…"
utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
#> [1] "😀​😁​😂​😃​😄​😅​😆​😇​😈​😉​😊​😋​😌​😍​😎​😏​😐​😑​😒​😓​😔​😕​😖​😗​😘​😙​😚​😛​😜​😝​😞​😟​😠​😡​😢​😣​😤​😥​😦​😧​😨​😩​😪​😫​😬​😭​😮​😯​😰​😱​😲​😳​😴​😵​😶​😷​😸​😹​😺​😻​😼​😽​😾​😿​🙀​🙁​🙂​🙃​🙄​🙅​🙆​🙇​🙈​🙉​🙊​🙋​🙌​🙍​🙎​🙏​"


Cite utf8 with the following BibTeX entry:

  title = {utf8: Unicode Text Processing},
  author = {Patrick O. Perry},
  year = {2017},
  note = {R package version 1.1.2},
  url = {https://CRAN.R-project.org/package=utf8},


The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code.

This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.


utf8 1.1.3


  • Make output_utf8() always return TRUE on Windows, so that characters in the user's native locale don't get escaped by utf8_encode(). The downside of this change is that on Windows, utf8_width() reports the wrong values for characters outside the user's locale when stdout() is redirected by knitr or another process.

  • When truncating long strings strings via utf8_format(), use an ellipsis that is printable in the user's native locale ("\u2026" or "...").

utf8 1.1.2 (2017-12-14)


  • Fix bug in utf8_format() with non-NULL width argument.

utf8 1.1.1 (2017-11-28)


  • Fix PROTECT bug in as_utf8().

utf8 1.1.0 (2017-11-20)


  • Added output_ansi() and output_utf8() functions to test for output capabilities.


  • Add utf8 argument to utf8_encode(), utf8_format(), utf8_print(), and utf8_width() for precise control over assumed output capabilities; defaults to the result of output_utf8().

  • Add ability to style backslash escapes with the escapes arguments to utf8_encode() and utf8_print(). Switch from "faint" styling to no styling by default.

  • Slightly reword error messages for as_utf8().

  • Fix (spurious) rchk warnings.


  • Fix bug in utf8_width() determining width of non-ASCII strings when LC_CTYPE=C.


  • No longer export the C version of as_utf8() (the R version is still present).

utf8 1.0.0 (2017-11-06)


  • Split off functions as_utf8(), utf8_valid(), utf8_normalize(), utf8_encode(), utf8_format(), utf8_print(), and utf8_width() from corpus package.

  • Added special handling for Unicode grapheme clusters in formatting and width measurement functions.

  • Added ANSI styling to escape sequences.

  • Added ability to style row and column names in utf8_print().

Reference manual

1.1.3 by Patrick O. Perry, 3 months ago


Report a bug at https://github.com/patperry/r-utf8/issues

Browse source code at https://github.com/cran/utf8

Authors: Patrick O. Perry [aut, cph, cre], Unicode, Inc. [cph, dtc] (Unicode Character Database)

Documentation:   PDF Manual  

Apache License (== 2.0) | file LICENSE license

Suggests corpus, knitr, rmarkdown, testthat

Imported by corpus, pillar.

See at CRAN