A simple client package for the Amazon Web Services ('AWS') Simple Storage Service ('S3') 'REST' 'API' < https://aws.amazon.com/s3/>.
aws.s3 is a simple client package for the Amazon Web Services (AWS) Simple Storage Service (S3) REST API. While other packages currently connect R to S3, they do so incompletely (mapping only some of the API endpoints to R) and most implementations rely on the AWS command-line tools, which users may not have installed on their system.
To use the package, you will need an AWS account and to enter your credentials into R. Your keypair can be generated on the IAM Management Console under the heading Access Keys. Note that you only have access to your secret key once. After it is generated, you need to save it in a secure location. New keypairs can be generated at any time if yours has been lost, stolen, or forgotten. The aws.iam package profiles tools for working with IAM, including creating roles, users, groups, and credentials programmatically; it is not needed to use IAM credentials.
A detailed description of how credentials can be specified is provided at: https://github.com/cloudyr/aws.signature/. The easiest way is to simply set environmetn variables on the command line prior to starting R or via an Renviron.site
or .Renviron
file, which are used to set environment variables in R during startup (see ? Startup
). Or they can be set within R:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey", "AWS_SECRET_ACCESS_KEY" = "mysecretkey", "AWS_DEFAULT_REGION" = "us-east-1", "AWS_SESSION_TOKEN" = "mytoken")
To use the package with S3-compatible storage provided by other cloud platforms, set the AWS_S3_ENDPOINT
environment variable to the appropriate host name. By default, the package uses the AWS endpoint: s3.amazonaws.com
The package can be used to examine publicly accessible S3 buckets and publicly accessible S3 objects without registering an AWS account. If credentials have been generated in the AWS console and made available in R, you can find your available buckets using:
library("aws.s3")bucketlist()
If your credentials are incorrect, this function will return an error. Otherwise, it will return a list of information about the buckets you have access to.
To get a listing of all objects in a public bucket, simply call
get_bucket(bucket = '1000genomes')
Amazon maintains a listing of Public Data Sets on S3.
To get a listing for all objects in a private bucket, pass your AWS key and secret in as parameters. (As described above, all functions in aws.s3 will look for your keys as environment variables by default, greatly simplifying the process of making a s3 request.)
# specify keys in-lineget_bucket( bucket = 'my_bucket', key = YOUR_AWS_ACCESS_KEY, secret = YOUR_AWS_SECRET_ACCESS_KEY) # specify keys as environment variablesSys.setenv("AWS_ACCESS_KEY_ID" = "mykey", "AWS_SECRET_ACCESS_KEY" = "mysecretkey")get_bucket("my_bucket")
S3 can be a bit picky about region specifications. bucketlist()
will return buckets from all regions, but all other functions require specifying a region. A default of "us-east-1"
is relied upon if none is specified explicitly and the correct region can't be detected automatically. (Note: using an incorrect region is one of the most common - and hardest to figure out - errors when working with S3.)
There are eight main functions that will be useful for working with objects in S3:
s3read_using()
provides a generic interface for reading from S3 objects using a user-defined functions3write_using()
provides a generic interface for writing to S3 objects using a user-defined functionget_object()
returns a raw vector representation of an S3 object. This might then be parsed in a number of ways, such as rawToChar()
, xml2::read_xml()
, jsonlite::fromJSON()
, and so forth depending on the file format of the objectsave_object()
saves an S3 object to a specified local fileput_object()
stores a local file into an S3 buckets3save()
saves one or more in-memory R objects to an .Rdata file in S3 (analogously to save()
). s3saveRDS()
is an analogue for saveRDS()
s3load()
loads one or more objects into memory from an .Rdata file stored in S3 (analogously to load()
). s3readRDS()
is an analogue for readRDS()
s3source()
sources an R script directly from S3They behave as you would probably expect:
# save an in-memory R object into S3s3save(mtcars, bucket = "my_bucket", object = "mtcars.Rdata") # `load()` R objects from the files3load("mtcars.Rdata", bucket = "my_bucket") # get file as raw vectorget_object("mtcars.Rdata", bucket = "my_bucket")# alternative 'S3 URI' syntax:get_object("s3://my_bucket/mtcars.Rdata") # save file locallysave_object("mtcars.Rdata", file = "mtcars.Rdata", bucket = "my_bucket") # put local file into S3put_object(file = "mtcars.Rdata", object = "mtcars2.Rdata", bucket = "my_bucket")
This package is not yet on CRAN. To install the latest development version you can install from the cloudyr drat repository:
# latest stable versioninstall.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat")) # on windows you may need:install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"), INSTALL_opts = "--no-multiarch")
Or, to pull a potentially unstable version directly from GitHub:
if (!require("remotes")) { install.packages("remotes")}remotes::install_github("cloudyr/aws.s3")
s3write_using()
now attaches the correct file extension to the temporary file being written to (just as s3read_using()
already did). (#226, h/t @jon-mago)s3sync()
gains a direction
argument allowing for unidirectional (upload-only or download-only) synchronization. The default remains bi-directional.put_encryption()
, get_encryption()
, and delete_encryption()
implement bucket-level encryption so that encryption does not need to be specified for each put_object()
call. (#183, h/t Dan Tenenbaum)s3sync()
. (#211, h/t Nirmal Patel)put_bucket()
only includes a LocationConstraint body when the region != "us-east-1". (#171, h/t David Griswold)setup_s3_url()
. (#223, h/t Peter Foley)s3write_using()
. (#205, h/t Patrick Miller)acl
argument was ignored by put_bucket()
. This is now fixed. (#172)base_url
argument in s3HTTP()
now defaults to an environment variable - AWS_S3_ENDPOINT
- or the AWS S3 default in order to facilitate using the package with S3-compatible storage. (#189, #191, #194)save_object()
now uses httr::write_disk()
to avoid having to load a file into memory. (#158, h/t Arturo Saco)endsWith()
in two places to reduce (implicit) base R dependency. (#147, h/t Huang Pan)put_object()
and put_bucket() now expose explicit
acl` arguments. (#137)get_acl()
and put_acl()
are now exported. (#137)put_folder()
convenience function for creating an empty pseudo-folder.put_bucket()
now errors if the request is unsuccessful. (#132, h/t Sean Kross)setup_s3_url()
when region = ""
.bucketlist()
gains both an alias, bucket_list_df()
, and an argument add_region
to add a region column to the output data frame.s3sync()
function. (#20)save_object()
now creates a local directory if needed before trying to save. This is useful for object keys contains /
.s3HTTP()
.s3readRDS()
and s3saveRDS()
.s3readRDS()
. (#59)put_object()
(#80)tempfile()
instead of rawConnection()
for high-level read/write functions. (#128)get_bucket()
. (#88)get_object()
now returns a pure raw vector (without attributes). (#94)s3sync()
relies on get_bucket(max = Inf)
. (#20)s3HTTP()
gains a base_url
argument to (potentially) support S3-compatible storage on non-AWS servers. (#109)s3HTTP()
gains a dualstack
argument provide support for "dual stack" (IPv4 and IPv6) support. (#62)get_bucket()
when max = Inf
. (#127, h/t Liz Macfie)s3read_using()
and s3write_using()
provide a generic interface to reading and writing objects from S3 using a specified function. This provides a simple and extensible interface for the import and export of objects (such as data frames) in formats other than those provided by base R. (#125, #99)s3HTTP()
gains a url_style
argument to control use of "path"-style (new default) versus "virtual"-style URL paths. (#23, #118)s3save()
gains an envir
argument. (#115)get_bucket()
now automatically handles pagination based upon the specified number of objects to return. (PR #104, h/t Thierry Onkelinx)get_bucket_df()
now uses an available (but unexported) as.data.frame.s3_bucket()
method. The resulting data frame always returns character rather than factor columns.s3HTTP()
. (#46, #106 h/t John Ramey)bucketlist()
now returns (in addition to past behavior of printing) a data frame of buckets.get_bucket_df()
returns a data frame of bucket contents. get_bucket()
continues to return a list. (#102, h/t Dean Attali)s3HTTP()
gains a check_region
argument (default is TRUE
). If TRUE
, attempts are made to verify the bucket's region before performing the operation in order to avoid confusing out-of-region errors. (#46)object = "s3://bucket_name/object_key"
. In all cases, the bucketname and object key will be extracted from this string (meaning that a bucket does not need to be explicitly specified). (#100; h/t John Ramey)get_bucket()
S3 generic and methods.=
). (#64)s3save_image()
to save an entire workspace.Remotes
field.s3source()
as a convenience function to source an R script directly from S3. (#54)s3save()
, s3load()
, s3saveRDS()
, and s3readRDS()
no longer write to disk, improving performance. (#51)s3saveRDS()
and s3readRDS()
. (h/t Steven Akins, #50)get_object()
). Previously available functions that did not conform to this format have been deprecated. They continue to work, but issue a warning. (#28)bucket
and object
names was swapped in most object-related functions and the Bucket name has been added to the object lists returned by getbucket()
. This means that bucket
can be omitted when object
is an object of class "s3_object".