Collection of high-level, robust, machine- and OS-independent tools for making deeply reproducible and reusable content in R. This includes light weight package management (similar to 'packrat' and 'checkpoint', but more flexible, lightweight and simpler than both), tools for caching, downloading and verifying or writing checksums, post-processing of common spatial datasets, and accessing GitHub repositories. Some features are still under active development.
A set of tools for R that enhance reproducibility beyond package management.
Built on top of
archivist, this package aims at making high-level, robust, machine and OS independent tools for making deeply reproducible and reusable content in R.
This extends beyond the package management utilites of
checkpoint by including tools for caching and accessing GitHub repositories.
Install from CRAN:
Install from GitHub:
#install.packages("devtools")library("devtools")install_github("PredictiveEcology/reproducible", dependencies = TRUE) # stable
Install from GitHub:
#install.packages("devtools")library("devtools")install_github("PredictiveEcology/reproducible", ref = "development", dependencies = TRUE) # unstable
Known issues: https://github.com/PredictiveEcology/reproducible/issues
Cachesaving to SQLite database, via
options("reproducible.futurePlan"), if the
futurepackage is installed. This is
do.callfunction is Cached, previously, it would be labelled in the database as
do.call. Now it attempts to extract the actual function being called by the
do.call. Messaging is similarly changed.
reproducible.ask, logical, indicating whether
clearCacheshould ask for deletions when in an interactive session
dlFun, to pass a custom function for downloading (e.g., "raster::getData")
prepInputswill automatically use
readRDSif the file is a
prepInputswill return a
fun = "base::load", with a message; can still pass an
envirto obtain standard behaviour of
clearCache- new argument
assessDataType, used in
postProcess, to identify smallest
datatypefor Raster* objects, if user does not pass an explicity
git2rupdate (@stewid, #36).
.prepareRasterBackedFile-- now will postpend an incremented numeric to a cached copy of a file-backed Raster object, if it already exists. This mirrors the behaviour of the
.rdafile. Previously, if two Cache events returned the same file name backing a Raster object, even if the content was different, it would allow the same file name. If either cached object was deleted, therefore, it would cause the other one to break as its file-backing would be missing.
spades.XXXand should have been
copyFiledid not perform correctly under all cases; now better handling of these cases, often sending to
file.copy(slower, but more reliable)
extractFromArchiveneeded a new
Checksumfunction call under some circumstances
extractFromArchive-- when dealing with nested zips, not all args were passed in recursively (#37, @CeresBarros)
prepInputs-- arguments that were same as
Cachewere not being correctly passed internally to
Cache, and if wrapped in Cache, it was not passed into prepInputs. Fixed.
.prepareFileBackedRasterwas failing in some cases (specifically if it was inside a
do.call) (#40, @CeresBarros).
Cachewas failing under some cases of
Cache(do.call, ...). Fixed.
Cache-- when arguments to Cache were the same as the arguments in
FUN, Cache would "take" them. Now, they are correctly passed to the
preProcess-- writing to checksums may have produced a warning if
CHECKSUMS.txtwas not present. Now it does not.
convertRasterPathsto assist with renaming moved files.
prepInputs -- new features
alsoExtractnow has more options (
"similar") and defaults to extracting all files in an archive (
postProcessaltogether if no
rasterToMatch. Previously, this would invoke Cache even if there was nothing to
copyFilecorrectly handles directory names containing spaces.
makeMemoisablefixed to handle additonal edge cases.
prepInputsto aid in data downloading and preparation problems, solved in a reproducible, Cache-aware way.
postProcesswhich is a wrapper for sequences of several other new functions (
downloadFilecan handle Google Drive and ftp/http(s) files
compareNAdoes comparisons with NA as a possible value e.g.,
compareNA(c(1,NA), c(2, NA))returns
Cache -- new features:
verbosewhich can help with debugging
useCachewhich allows turning caching on and off at a high level (e.g., options("useCache"))
cacheIdwhich allows user to hard code a result from a Cache
Cachefunction calls, unless explicitly set on the inner functions
userTagsadded automatically to cache entries so much more powerful searching via
checksums now returns a data.table with the same columns whether
write = TRUE or
write = FALSE.
showCache now give messages and require user intervention if request to
clearCache would be large quantities of data deleted
memoise::memoise now used on 3rd run through an identical
Cache call, dramatically speeding up in most cases
asPath has a new argument indicating how deep should the path be considered when included in caching (only relevant when
quick = TRUE)
New vignette on using Cache
parallel-safe, meaning there are
tryCatch around every attempt at writing to SQLite database so it can be used safely on multi-threaded machines
bug fixes, unit tests, more
imports for packages e.g.,
updates for R 3.6.0 compact storage of sequence vectors
experimental pipes (
%C%) and assign
several performance enhancements
mergeCache: a new function to merge two different Cache repositories
memoise::memoise is now used on
loadFromLocalRepo, meaning that the 3rd time
Cache() is run on the same arguments (and the 2nd time in a session), the returned Cache will be from a RAM object via memoise. To stop this behaviour and use only disk-based Caching, set
options(reproducible.useMemoise = FALSE) .
Cache assign --
%<% can be used instead of normal assign, equivalent to
lhs <- Cache(rhs).
new option: reproducible.verbose, set to FALSE by default, but if set to true may help understand caching behaviour, especially for complex highly nested code.
all options now described in
All Cache arguments other than FUN and ... will now propagate to internal, nested Cache calls, if they are not specified explicitly in each of the inner Cache calls.
Cached pipe operator
%C% -- use to begin a pipe sequence, e.g.,
Cache() %C% ...
sideEffect can now be a path
digestPathContent default changed from FALSE (was for speed) to TRUE (for content accuracy)
searchFull, which shows the full search path, known alternatively as "scope", or "binding environments". It is where R will search for a function when requested by a user.
memoise::memoise for several functions (
available.packages) for speed -- will impact memory at the expense of speed.
requireon those 20 packages, but
requiredoes not check for dependencies and deal with them if missing: it just errors. This speed should be fast enough for many purposes.
dplyr from Imports
RCurl to Imports
change name of
digestRasteraffecting in-memory rasters