Implements an S3 class for storing 'UTF-8' strings, based on regular character vectors. Also contains routines to portably read and write 'UTF-8' encoded text files, to convert all strings in an object to 'UTF-8', and to create character vectors with various encodings.
Portable tools for UTF-8 character data
The character encoding of determines the translation of the letters, digits, or other codepoints (atomic components of a text) into a sequence of bytes. A byte sequence may translate into valid text in one character encoding, but give nonsense in other character encodings.
For historic reasons, R can store strings in different ways:
On OS X and Linux, the "native" encoding is often UTF-8, but on Windows it is not. To add to the confusion, the encoding is a property of individual strings in a character vector, and not of the entire vector.
When working with text, it is advisable to use UTF-8, because it allows encoding virtually any text, even in foreign languages that contain symbols that cannot be represented in your system's native encoding. The UTF-8 encoding possesses several nice technical properties, and is by far the predominant encoding on the Web. Standardization on a "universal" encoding faciliates data exchange.
Because of R's special handling of strings, some care must be taken to make sure that you're actually using the UTF-8 encoding. Many functions in R will hide encoding issues from you, and transparently convert to UTF-8 as necessary. However, some functions (such as reading and writing files) will stubbornly prefer the native encoding.
The enc pacakge provides helpers for converting all textual components of an object to UTF-8, and for reading and writing files in UTF-8 (with a LF end-of-line terminator by default). It also defines an S3 class for tagging all-UTF-8 character vectors and ensuring that updates maintain the UTF-8 encoding. Examples for other packages that use UTF-8 by default are:
library(enc)utf8(c("a", "ä"))as_utf8(1)#> [1] "1"a <- utf8("ä")a[2] <- "ö"class(a)#> [1] "utf8"data.frame(abc = letters[1:3], utf8 = utf8(letters[1:3]))#> abc utf8#> 1 a a#> 2 b b#> 3 c c
Install the package from GitHub:
# install.packages("devtools")devtools::install_github("krlmlr/enc")
Initial release.
utf8
class with constructor, coercion, combination, formatting, printing, and checked updates.to_encoding()
performs deep encoding conversion of objects, including names and other attributes. Variants: to_utf8()
, to_native()
, to_latin1()
and to_alien()
.encoding()
, returns "ASCII"
for pure ASCII strings and behaves identically to base::Encoding()
otherwise.all_utf8()
, returns a logical scalar that indicates if all elements of a character vector are UTF-8 encoded; this includes pure ASCII strings.read_lines_enc()
, try_read_lines_enc()
, and write_lines_enc()
for robust reading and writing of text files. Returns/accepts objects of class utf8
.transform_lines_enc()
, with robust handling if only some files could be transformed in transform_lines_enc()
. Uses try_read_lines_enc()
, therefore only warns if file is missing. Auto-detects and maintains EOL delimiter.