Lightweight Access to the 'Geospatial Data Abstraction Library' ('GDAL')

Provides low-level access to 'GDAL' functionality for R packages. The aim is to minimize the level of interpretation put on the 'GDAL' facilities, to enable direct use of it for a variety of purposes. 'GDAL' is the 'Geospatial Data Abstraction Library' a translator for raster and vector geospatial data formats that presents a single raster abstract data model and single vector abstract data model to the calling application for all supported formats < http://gdal.org/>. Other available packages 'rgdal' and 'sf' also provide access to the 'GDAL' library, but neither can be used for these lower level tasks, and both do many other tasks.


Build_Status AppVeyor build statusCoverage_Status CRAN_Status_Badge

Overview

The vapour package provides access to the basic read functions available in GDALfor both raster and a vector data sources.

The functions are deliberately lower-level than these data models and provide access to the component entities independently.

For vector data vapour provides:

  • read access to feature attributes.
  • read access to raw binary geometry.
  • read access to geometry in text forms (GeoJSON, WKT, GML, KML).
  • read access to the extent, or bounding box, of feature geometries.

All vector/feature read tasks can optionally apply OGRSQL to a layer prior to data extraction.

For raster data vapour provides:

  • read access to the list of available rasters within a collection source (subdatasets).
  • read access to structural metadata for individual raster sources.
  • read access for raw data using GDAL's RasterIO framework and its dynamic image decimation / replication resampling algorithms.

The workflows available are intended to support development of applications in R for these vector and raster data without being constrained to any particular data model.

Installation

The package can be installed from Github.

devtools::install_github("hypertidy/vapour", build_vignettes = TRUE)

You will need development tools for building R packages.

On Linux and MacOS building also requires an available GDAL installation, but on Windows the ROpenSci rwinlib tools are used and the required GDAL will be downloaded and used when building the package. This installation is self-contained and only affects the use of R, it can be used alongside other applications using GDAL.

Purpose

The goal of vapour is to provide a basic GDAL API package for R. The key functions provide vector geometry or attributes and raster data and raster metadata.

The priority is to give low-level access to key functionality rather than comprehensive coverage of the library. The real advantage of vapour is the flexibility of a modular workflow, not the outright efficiency.

A parallel goal is to be freed from the powerful but sometimes limiting high-level data models of GDAL itself, specifically these are simple features and affine-based regular rasters composed of 2D slices. (GDAL will possibly remove these limitations over time but still there will always be value in having modularity in an ecosystem of tools.)

These loftier general needs have come out of smaller more concrete goals, one was access to the "attributes-only" capacity of GDAL as a virtual database engine, and another access to the dense structures provided by transport vector data. GDAL's dynamic resampling of arbitrary raster windows is also very useful for interactive tools on local data, and seems under-utilized in favour of less accessible online image services.

This partly draws on work done in the sf package and in packages rgdal and rgdal2. I'm amazed that something as powerful and general as GDAL is still only available through these lenses, but recentish improvements make things much easier to use and share. Specifically Rcpp means that access to external libs is simplified, easier to learn and easier to get started and make progress. The other part is that cross-platform support is now much better, with more consistency on the libraries available on the CRAN machines and in other contexts.

Warnings

It's possible to give problematic "SELECT" statements via the sql argument. Note that the geometry readers vapour_read_geometry, vapour_read_geometry_text, and vapour_read_extent will strip out the SELECT ... FROM clause and replace it with SELECT * FROM to ensure that the geometry is accessible, though the attributes are ignored. This means we can allow the user or dplyr to create any SELECT statement. The function vapour_read_geometry will return a list of NULLs, in this case.

I haven't tried this against a real database, I'm not sure if we need AsBinary() around EWKB geoms, for example - but at any rate these can be ingested by sf.

Examples

The package documentation page gives an overview of available functions.

help("vapour-package")

See the vignettes and documentation for examples WIP.

The following concrete example illustrates the motivation for vapour, through a timing benchmark for one standard operation: extracting feature geometries from a data set within a user-defined bounding box. The data set is the one used throughout the book Geocomputation in R by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow, and can be obtained with the following code.

url <- file.path ("http://www.naturalearthdata.com/http//www.naturalearthdata.com",
                "download/10m/cultural/ne_10m_parks_and_protected_lands.zip")
download.file (url = url, destfile = "USA_parks.zip")
unzip (zipfile = "USA_parks.zip", exdir = "usa_parks")
fname <- "usa_parks/ne_10m_parks_and_protected_lands_area.shp"

That last fname is the file we're interested in, which contains polygons for all United States parks. We now construct a timing benchmark for three ways of extracting the data within a pre-defined bounding box of:

library (magrittr)
bb <- c (-120, 20, -100, 40)
names (bb) <- c ("xmin", "ymin", "xmax", "ymax")
bb_sf <- sf::st_bbox (bb, crs = sf::st_crs (4326)) %>%
    sf::st_as_sfc ()

First, we define a function to do the desired extraction using the sf package, comparing both st_crop and the sf::[ sub-selection operator:

f_sf1 <- function (fname)
{
    usa_parks <- sf::st_read (fname, quiet = TRUE)
    suppressMessages (suppressWarnings (
                    parks2 <- sf::st_crop (usa_parks, bb_sf)
                    ))
}
f_sf2 <- function (fname)
{
    usa_parks <- sf::st_read (fname, quiet = TRUE)
    suppressMessages (suppressWarnings (
                    parks2 <- usa_parks [bb_sf, ]
                    ))
}

Then three approaches using vapour, in each case extracting equivalent data to sf - that is, both geometries and attributes - yet simply leaving them separate here. The three approaches are:

  1. Read geometry as WKB, sub-select after reading, and convert to sfc lists;
  2. Read geometry as WKB pre-selected via an SQL statement, and convert to sfc lists; and
  3. Read geometry as json (text), pre-selected via SQL, and convert with the geojsonsf package
library (vapour)
f_va1 <- function (fname) # read all then sub-select
{
    ext <- do.call (rbind, vapour_read_extent (fname)) # bboxes of each feature
    indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
                   ext [, 3] > bb [2] & ext [, 4] < bb [4])
    g <- vapour_read_geometry (fname) [indx] %>%
        sf::st_as_sfc ()
    a <- lapply (vapour_read_attributes (fname), function (i) i [indx])
}
f_va2 <- function (fname) # read selection only via SQL
{
    ext <- do.call (rbind, vapour_read_extent (fname))
    indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
                   ext [, 3] > bb [2] & ext [, 4] < bb [4])
    n <- paste0 (vapour_read_names (fname) [indx], collapse = ",") # GDAL FIDs
    stmt <- paste0 ("SELECT FID FROM ", vapour_layer_names (fname),
                    " WHERE FID in (", n, ")")
    g <- vapour_read_geometry (fname, sql = stmt) %>%
        sf::st_as_sfc ()
    a <- vapour_read_attributes (fname, sql = stmt)
}
f_va3 <- function (fname) # convert json text via geojsonsf
{
    ext <- do.call (rbind, vapour_read_extent (fname)) # bboxes of each feature
    indx <- which (ext [, 1] > bb [1] & ext [, 2] < bb [3] &
                   ext [, 3] > bb [2] & ext [, 4] < bb [4])
    n <- paste0 (vapour_read_names (fname) [indx], collapse = ",") # GDAL FIDs
    stmt <- paste0 ("SELECT FID FROM ", vapour_layer_names (fname),
                    " WHERE FID in (", n, ")")
    g <- vapour_read_geometry_text (fname, textformat = "json", sql = stmt) 
    g <- lapply (g, function (i) geojsonsf::geojson_sfc (i) [[1]]) %>%
        sf::st_sfc ()
    a <- vapour_read_attributes (fname, sql = stmt)
}

The benchmark timings - in particular the "relative" values - then illustrate the advantages of vapour:

rbenchmark::benchmark (
                       f_sf1 (fname),
                       f_sf2 (fname),
                       f_va1 (fname),
                       f_va2 (fname),
                       f_va3 (fname),
                       replications = 10)
#>           test replications elapsed relative user.self sys.self user.child
#> 1 f_sf1(fname)           10   0.307    3.936     0.300    0.004          0
#> 2 f_sf2(fname)           10   0.198    2.538     0.196    0.004          0
#> 3 f_va1(fname)           10   0.078    1.000     0.072    0.008          0
#> 4 f_va2(fname)           10   0.088    1.128     0.084    0.004          0
#> 5 f_va3(fname)           10   0.205    2.628     0.200    0.008          0
#>   sys.child
#> 1         0
#> 2         0
#> 3         0
#> 4         0
#> 5         0

Reading geometries only, as opposed to the sf reading of all geometries and attributes, affords a speed increase of about 25%, while utilizing the SQL capabilities of ogr_sql offers an increase of around 75%.

Context

Examples of packages that use vapour are in development, RGDALSQL and lazyraster. RGDALSQL aims to leverage the facilities of GDAL to provide data on-demand for many sources as if they were databases. lazyraster uses the level-of-detail facility of GDAL to read just enough resolution from a raster source using traditional window techniques.

Limitations, work-in-progress and other discussion are active here: https://github.com/hypertidy/vapour/issues/4

We've kept a record of a minimal GDAL wrapper package here:

https://github.com/diminutive/gdalmin

Before those I had worked on getting sp and dplyr to at least work together https://github.com/dis-organization/sp_dplyrexpt and recently rgdal was updated to allow tibbles to be used, something that spbabel and spdplyr really needed to avoid friction.

Early exploration of allow non-geometry read with rgdal was tried here: https://github.com/r-gris/gladr

Thanks to Edzer Pebesma and Roger Bivand and Tim Keitt for prior art that I crib and copy from. Jeroen Ooms helped the R community hugely by providing an accessible build process for libraries on Windows. Mark Padgham helped kick me over a huge obstacle in using C++ libraries with R. Simon Wotherspoon and Ben Raymond have endured ravings about wanting this level of control for many years.

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

News

vapour 0.2.0

  • Fixed incorrect use of C-level printf() and unnecessary tests thanks to CRAN feedback.

  • Add 'type' to output of vapour_geom_summary(), the integer code for each geometry type.

  • Restructured conversion to text geometries to properly free up memory, was causing memory leaks (found with valgrind).

  • Use of sql now correctly uses 'GDALDataset::ReleaseResultsSet()' in each applicable function. Query with the 'sql' argument was not being executed in vapour_layer_names() but now is - only has utility for insert and drop queries, so will be rarely used and probably never had any impact before.

  • Raster read gains a new argument native = FALSE to control use of the native window without specifying it. If native = TRUE then the native dimensions are used and read in full.

  • Vector read functions gain new extent argument to apply a spatial filter in conjunction with the sql argument, per discussion (#34). The extent can be of type sp, sf, raster, or generic vector c(xmin, xmax, ymin, ymax). The extent is ignored if the sql argument is not specified, with a warning. Applies to vapour_geom_summary(), vapour_read_attributes(), vapour_read_extent(), vapour_read_geometry(), vapour_read_geometry_text(), and vapour_read_names(). Thanks to Joseph Stachelek for the suggestion.

  • Vector read functions gain new limit_n and skip_n arguments to specify a sequential set of features to be read, added to vapour_geom_summary(),
    vapour_read_attributes(), vapour_read_extent(), vapour_read_geometry(), vapour_read_geometry_text(), and vapour_read_names(). Their effect occurs after that of the sql argument.

  • New function vapour_driver() to determine chosen driver (vector, short name) of the data source.

  • New argument min_max to allow fast default use of vapour_raster_info(), as per (#50).

  • New function vapour_raster_gcp() to return GCP (ground control points) if present.

  • New function vapour_report_attributes() to return a string description of the internal GDAL type for data attributes wish of Grant Williamson.

  • Attribute types 'OFTInteger64' are now returned as doubles, 'Date' and 'DateTime' are returned as strings.

  • New functions vapour_gdal_version() and vapour_all_drivers() to return information about the GDAL library in use.

  • Function vapour_read_names() is now implemented using the library API, rather than via OGRSQL. All trace of the OGRSQL FID workaround and warning has been removed.

  • New function vapour_geom_summary() to return basic identity, validity, and extent of feature geometries.

  • Values for the extent (bounding box) of a feature are now set to NA values for an empty or missing geometry.

  • Fix drastic memory leak in vapour_read_attributes(), thanks to Grant Williamson.

vapour 0.1.0

  • Functions that include a layer index now accept a layer name which is used to find the (0-based) layer index.

  • This is a breaking release, all old functions have been made defunct or removed entirely.

vapour 0.0.1

  • All C++ functions now have explicit R wrappers instead of Rcpp exports.

  • New naming convention uses vapour_ for vector sources, raster_ for raster sources to make function names a little more consistent.

  • New wrapper vapour_sds_names to replace sds_info and raster_sds_info which are now deprecated.

  • New wrapper vapour_read_geometry_what to replace vapour_read_feature_what which i is now deprecated.

  • New function vapour_read_names to return the vector of FID values.

  • The read for raster data now returns numeric or integer appropriately.

  • vapour_raster_info now includes bands as the count of available bands.

  • The IO read now allows a 4-element window to return data at native resolution, by copying the third and fourth values (source dimension) to the fifth and sixth values (output dimension) respectively.

  • Subdatasets are now supported.

  • Added sanity check behaviour to vapour_raster_read to avoid out of bounds read attempts.

  • Resampling options added to raster data read.

  • Upgraded to rwinlib gdal 2.2.3.

  • added function vapour_layer_names to return character layer names, and implicitly provide a layer count

  • feature iteration now avoids "GetFeatureCount" and collects each element in a growing list, thanks to Jim Hester

  • renamed format argument to textformat, this is ignored unless what = "text"

  • Created a single C++ feature read to remove repeated logic, vapour_read_geometry, vapour_read_geometry_text, and vapour_read_extent all call the same function.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("vapour")

0.2.0 by Michael Sumner, 3 months ago


https://github.com/hypertidy/vapour


Report a bug at https://github.com/hypertidy/vapour/issues


Browse source code at https://github.com/cran/vapour


Authors: Michael Sumner [aut, cre] , Simon Wotherspoon [ctb] (figured out the mechanism for the resampling algorithm) , Mark Padgham [ctb] (helped get started :)) , Edzer Pebesma [ctb] (wrote allocate_attribute , copied here from sf) , Roger Bivand [ctb] (wrote configure.ac , copied here from rgdal) , Jim Hester [ctb] (wrote CollectorList.h , copied here from fs package) , Timothy Keitt [ctb] (wrote GetPointsInternal copied here from rgdal2 package) , Jeroen Ooms [ctb] (tweaked build process)


Documentation:   PDF Manual  


GPL-3 license


Imports Rcpp, utils

Suggests covr, dplyr, geojsonsf, testthat, knitr, rbenchmark, rmarkdown, spelling

Linking to Rcpp

System requirements: GDAL (>= 2.0.0), PROJ.4 (>= 4.8.0)


See at CRAN