Flat text files provide a robust, compressible, and portable way to store tables from databases. This package provides convenient functions for exporting tables from relational database connections into compressed text files and streaming those text files back into a database without requiring the whole table to fit in working memory.
The goal of arkdb is to provide a convienent way to move data from large compressed text files (tsv, csv, etc) into any DBI-compliant database connection (e.g. MYSQL, Postgres, SQLite; see DBI), and move tables out of such databases into text files. The key feature of arkdb is that files are moved between databases and text files in chunks of a fixed size, allowing the package functions to work with tables that would be much to large to read into memory all at once.
You can install arkdb from GitHub with:
library(arkdb)# additional libraries just for this demolibrary(dbplyr)library(dplyr)library(fs)
nycflights database in SQLite:
tmp <- tempdir() # Or can be your working directory, "."db <- dbplyr::nycflights13_sqlite(tmp)#> Caching nycflights db at /var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T//Rtmp3oFbR8/nycflights13.sqlite#> Creating table: airlines#> Creating table: airports#> Creating table: flights#> Creating table: planes#> Creating table: weather
Create an archive of the database:
dir <- fs::dir_create(fs::path(tmp, "nycflights"))ark(db, dir, lines = 50000)#> Exporting airlines in 50000 line chunks:#> ...Done! (in 0.00859499 secs)#> Exporting airports in 50000 line chunks:#> ...Done! (in 0.03927302 secs)#> Exporting flights in 50000 line chunks:#> ...Done! (in 27.67571 secs)#> Exporting planes in 50000 line chunks:#> ...Done! (in 0.2452919 secs)#> Exporting weather in 50000 line chunks:#> ...Done! (in 2.398423 secs)
Import a list of compressed tabular files (i.e.
*.csv.bz2) into a
local SQLite database:
files <- fs::dir_ls(dir)new_db <- src_sqlite(fs::path(tmp, "local.sqlite"), create=TRUE)unark(files, new_db, lines = 50000)#> Importing airlines.tsv.bz2 in 50000 line chunks:#> ...Done! (in 0.04621601 secs)#> Importing airports.tsv.bz2 in 50000 line chunks:#> ...Done! (in 0.146152 secs)#> Importing flights.tsv.bz2 in 50000 line chunks:#> ...Done! (in 17.44235 secs)#> Importing planes.tsv.bz2 in 50000 line chunks:#> ...Done! (in 0.06386805 secs)#> Importing weather.tsv.bz2 in 50000 line chunks:#> ...Done! (in 1.224544 secs)new_db#> src: sqlite 3.22.0 [/var/folders/y8/0wn724zs10jd79_srhxvy49r0000gn/T/Rtmp3oFbR8/local.sqlite]#> tbls: airlines, airports, flights, planes, weather
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
keep-openmethod would cut off header names for Postgres connections (due to variation in the behavior of SQL queries with
LIMIT 0.) The issue is now resolved by accessing the header in a more robust, general way.
unark()will strip out non-compliant characters in table names by default.
unark()gains the optional argument
tablenames, allowing the user to specify the corresponding table names manually, rather than enforcing they correspond with the incoming file names. #18
unark()gains the argument
encoding, allowing users to directly set the encoding of incoming files. Previously this could only be set by setting
options(encoding), which will still work as well. See
examplesfor an illustration.
unark()will now attempt to guess which streaming parser to use (e.g
tsv) based on the file extension pattern, rather than defaulting to a
ark()still defaults to exporting in the more portable