A programmatic interface to many species occurrence data sources, including Global Biodiversity Information Facility ('GBIF'), 'USGSs' Biodiversity Information Serving Our Nation ('BISON'), 'iNaturalist', 'eBird', Integrated Digitized 'Biocollections' ('iDigBio'), 'VertNet', Ocean 'Biogeographic' Information System ('OBIS'), and Atlas of Living Australia ('ALA'). Includes functionality for retrieving species occurrence data, and combining those data.
spocc
= SPecies OCCurrence data
At rOpenSci, we have been writing R packages to interact with many sources of species occurrence data, including GBIF, Vertnet, BISON, iNaturalist, the Berkeley ecoengine, and eBird. Other databases are out there as well, which we can pull in. spocc
is an R package to query and collect species occurrence data from many sources. The goal is to to create a seamless search experience across data sources, as well as creating unified outputs across data sources.
spocc
currently interfaces with nine major biodiversity repositories
Global Biodiversity Information Facility (GBIF) (via rgbif
)
GBIF is a government funded open data repository with several partner organizations with the express goal of providing access to data on Earth's biodiversity. The data are made available by a network of member nodes, coordinating information from various participant organizations and government agencies.
Berkeley Ecoengine (via ecoengine
)
The ecoengine is an open API built by the Berkeley Initiative for Global Change Biology. The repository provides access to over 3 million specimens from various Berkeley natural history museums. These data span more than a century and provide access to georeferenced specimens, species checklists, photographs, vegetation surveys and resurveys and a variety of measurements from environmental sensors located at reserves across University of California's natural reserve system.
iNaturalist iNaturalist provides access to crowd sourced citizen science data on species observations.
VertNet (via rvertnet
)
Similar to rgbif
, ecoengine, and rbison
(see below), VertNet provides access to more than 80 million vertebrate records spanning a large number of institutions and museums primarly covering four major disciplines (mammology, herpetology, ornithology, and icthyology). Note that we don't currenlty support VertNet data in this package, but we should soon
Biodiversity Information Serving Our Nation (via rbison
)
Built by the US Geological Survey's core science analytic team, BISON is a portal that provides access to species occurrence data from several participating institutions.
eBird (via rebird
)
ebird is a database developed and maintained by the Cornell Lab of Ornithology and the National Audubon Society. It provides real-time access to checklist data, data on bird abundance and distribution, and communtiy reports from birders.
iDigBio (via ridigbio
)
iDigBio facilitates the digitization of biological and paleobiological specimens and their associated data, and houses specimen data, as well as providing their specimen data via RESTful web services.
OBIS OBIS (Ocean Biogeographic Information System) allows users to search marine species datasets from all of the world's oceans.
Atlas of Living Australia ALA (Atlas of Living Australia) contains information on all the known species in Australia aggregated from a wide range of data providers: museums, herbaria, community groups, government departments, individuals and universities; it contains more than 50 million occurrence records.
The inspiration for this comes from users requesting a more seamless experience across data sources, and from our work on a similar package for taxonomy data (taxize).
BEWARE: In cases where you request data from multiple providers, especially when including GBIF, there could be duplicate records since many providers' data eventually ends up with GBIF. See ?spocc_duplicates
, after installation, for more.
See CONTRIBUTING.md
Stable version from CRAN
install.packages("spocc", dependencies = TRUE)
Or the development version from GitHub
install.packages("devtools")devtools::install_github("ropensci/spocc")
library("spocc")
Get data from GBIF
(out <- occ(query = 'Accipiter striatus', from = 'gbif', limit = 100))#> Searched: gbif#> Occurrences - Found: 737,289, Returned: 100#> Search type: Scientific#> gbif: Accipiter striatus (100)
Just gbif data
out$gbif#> Species [Accipiter striatus (100)] #> First 10 rows of [Accipiter_striatus]#> #> # A tibble: 100 x 87#> name longitude latitude prov issues key datasetKey publishingOrgKey#> <chr> <dbl> <dbl> <chr> <chr> <int> <chr> <chr> #> 1 Acci… -104. 20.7 gbif cdrou… 1.81e9 50c9509d-… 28eb1a3f-1c15-4…#> 2 Acci… -98.6 33.8 gbif cdrou… 1.81e9 50c9509d-… 28eb1a3f-1c15-4…#> 3 Acci… -74.1 40.1 gbif cdrou… 1.81e9 50c9509d-… 28eb1a3f-1c15-4…#> 4 Acci… -122. 38.0 gbif cdrou… 1.80e9 50c9509d-… 28eb1a3f-1c15-4…...
Get fine-grained detail over each data source by passing on parameters to the packge rebird in this example.
(out <- occ(query = 'Setophaga caerulescens', from = 'gbif', gbifopts = list(country = 'US')))#> Searched: gbif#> Occurrences - Found: 239,219, Returned: 500#> Search type: Scientific#> gbif: Setophaga caerulescens (500)
Get just gbif data
out$gbif#> Species [Setophaga caerulescens (500)] #> First 10 rows of [Setophaga_caerulescens]#> #> # A tibble: 500 x 108#> name longitude latitude prov issues key datasetKey publishingOrgKey#> <chr> <dbl> <dbl> <chr> <chr> <int> <chr> <chr> #> 1 Seto… -80.3 25.7 gbif cdrou… 1.81e9 50c9509d-… 28eb1a3f-1c15-4…#> 2 Seto… -80.3 25.8 gbif cdrou… 1.81e9 50c9509d-… 28eb1a3f-1c15-4…#> 3 Seto… -81.4 28.6 gbif cdrou… 1.84e9 50c9509d-… 28eb1a3f-1c15-4…#> 4 Seto… -77.3 39.0 gbif cdrou… 1.84e9 50c9509d-… 28eb1a3f-1c15-4…#> 5 Seto… -83.2 41.6 gbif cdrou… 1.88e9 50c9509d-… 28eb1a3f-1c15-4…#> 6 Seto… -74.0 40.8 gbif cdrou… 1.84e9 50c9509d-… 28eb1a3f-1c15-4…#> 7 Seto… -80.8 35.5 gbif cdrou… 1.85e9 50c9509d-… 28eb1a3f-1c15-4…#> 8 Seto… -97.2 26.1 gbif cdrou… 1.84e9 50c9509d-… 28eb1a3f-1c15-4…#> 9 Seto… -80.3 25.8 gbif cdrou… 1.85e9 50c9509d-… 28eb1a3f-1c15-4…#> 10 Seto… -77.1 38.9 gbif cdrou… 1.84e9 50c9509d-… 28eb1a3f-1c15-4…#> # ... with 490 more rows, and 100 more variables: networkKeys <list>,#> # installationKey <chr>, publishingCountry <chr>, protocol <chr>,#> # lastCrawled <chr>, lastParsed <chr>, crawlId <int>,#> # basisOfRecord <chr>, taxonKey <int>, kingdomKey <int>,#> # phylumKey <int>, classKey <int>, orderKey <int>, familyKey <int>,#> # genusKey <int>, acceptedTaxonKey <int>, scientificName <chr>,#> # acceptedScientificName <chr>, kingdom <chr>, phylum <chr>,#> # order <chr>, family <chr>, genus <chr>, genericName <chr>,#> # specificEpithet <chr>, taxonRank <chr>, taxonomicStatus <chr>,#> # dateIdentified <chr>, coordinateUncertaintyInMeters <dbl>,#> # stateProvince <chr>, year <int>, month <int>, day <int>,#> # eventDate <date>, modified <chr>, lastInterpreted <chr>,#> # references <chr>, license <chr>, geodeticDatum <chr>, class <chr>,#> # countryCode <chr>, country <chr>, rightsHolder <chr>,#> # identifier <chr>, verbatimEventDate <chr>, datasetName <chr>,#> # verbatimLocality <chr>, gbifID <chr>, collectionCode <chr>,#> # occurrenceID <chr>, taxonID <chr>, catalogNumber <chr>,#> # recordedBy <chr>, `http://unknown.org/occurrenceDetails` <chr>,#> # institutionCode <chr>, rights <chr>, eventTime <chr>,#> # occurrenceRemarks <chr>,#> # `http://unknown.org/http_//rs.gbif.org/terms/1.0/Multimedia` <chr>,#> # identificationID <chr>, informationWithheld <chr>,#> # nomenclaturalCode <chr>, locality <chr>, vernacularName <chr>,#> # fieldNotes <chr>, verbatimElevation <chr>, behavior <chr>,#> # higherClassification <chr>, sex <chr>, lifeStage <chr>,#> # establishmentMeans <chr>, infraspecificEpithet <chr>, continent <chr>,#> # recordNumber <chr>, higherGeography <chr>, dynamicProperties <chr>,#> # endDayOfYear <chr>, georeferenceVerificationStatus <chr>,#> # county <chr>, language <chr>, type <chr>, preparations <chr>,#> # occurrenceStatus <chr>, startDayOfYear <chr>,#> # bibliographicCitation <chr>, accessRights <chr>, institutionID <chr>,#> # dataGeneralizations <chr>, organismID <chr>,#> # ownerInstitutionCode <chr>, datasetID <chr>, collectionID <chr>,#> # habitat <chr>, georeferencedDate <chr>, georeferencedBy <chr>,#> # georeferenceProtocol <chr>, otherCatalogNumbers <chr>,#> # georeferenceSources <chr>, identificationRemarks <chr>,#> # individualCount <int>
Get data from many sources in a single call
ebirdopts <- list(loc = 'CA') # search in Canada onlygbifopts <- list(country = 'US') # search in United States onlyout <- occ(query = 'Setophaga caerulescens', from = c('gbif','bison','inat','ebird'), gbifopts = gbifopts, ebirdopts = ebirdopts, limit = 50)dat <- occ2df(out)head(dat); tail(dat)#> # A tibble: 6 x 6#> name longitude latitude prov date key #> <chr> <chr> <chr> <chr> <date> <chr> #> 1 Setophaga caerulescens -80.347459 25.743763 gbif 2018-01-20 1806338790#> 2 Setophaga caerulescens -80.342233 25.77536 gbif 2018-01-19 1805421161#> 3 Setophaga caerulescens -81.355815 28.569623 gbif 2018-03-14 1837766480#> 4 Setophaga caerulescens -83.192381 41.627135 gbif 2018-04-28 1880571743#> 5 Setophaga caerulescens -77.254868 39.006651 gbif 2018-04-29 1841263350#> 6 Setophaga caerulescens -73.965355 40.782865 gbif 2018-04-29 1841260747#> # A tibble: 6 x 6#> name longitude latitude prov date key #> <chr> <chr> <chr> <chr> <date> <chr>#> 1 Setophaga caerulescens -63.4497222 44.5938889 ebird 2018-11-08 <NA> #> 2 Setophaga caerulescens -97.22659 49.8759422 ebird 2018-11-07 <NA> #> 3 Setophaga caerulescens -97.227492 49.876486 ebird 2018-11-07 <NA> #> 4 Setophaga caerulescens -79.3765 43.6799722 ebird 2018-11-06 <NA> #> 5 Setophaga caerulescens -79.6037 43.516773 ebird 2018-11-03 <NA> #> 6 Setophaga caerulescens -84.3526679 46.5101339 ebird 2018-11-03 <NA>
All data cleaning functionality is in a new package scrubr. scrubr
on CRAN.
All mapping functionality is now in a separate package mapr (formerly known as spoccutils
), to make spocc
easier to maintain. mapr
on CRAN.
spocc
in R doing citation(package = 'spocc')
occ()
now attempts to collect errors from requests that fail and puts these error messages (character strings) in the $meta$errors
spot. We can not always collect errors, and some data providers do not error well: they do not provide a meaningful error message other than that there was an error. (#189) (#207)occ()
gains new parameter throw_warnings
(logical). By default set to TRUE
(matches previous behavior) and throws warnings about errors that occur and when no results found for a query. We now prefix each warning with the data provider so you can match up an error or warning with a data provider and (hopefully) query. If set to FALSE
, warnings are suppressed (#189) (#207)spocc
. The AntWeb API has been down for a while, and no response from maintainers (#202) (#203)rebird
on CRAN requires a few changes in parameters used. Importantly, ebird now wants species codes instead of full scientific names, but we internally attempt to handle this, so users still just pass scientific names (#205)?spocc_duplicates
manual file for duplicate records, refer to scrubr
and CoordinateCleaner
packages (#198)inspect()
manual file, clarify what the function does (#194)occ()
gains a return
block with detail about what's returned from the function (#208)occ()
gains new parameter date
to do date range based searches across data sources without having to know the vagaries of each data source (#181)print.occdatind
so that empty data.frame's don't throw tibble warnings (#184)stand_dates()
due to ALA giving back a timestamp now (#182) (#185)wicket
C++ based package instead. So you no longer need V8
which should make installation easier on some platforms. (#172)httr
replaced with crul
for HTTP reqeusts (#174)curl::curl_options()
(#176)as.*()
functions can now pass on curl options to the
http client (#177)foo_ala()
- the internal plugin for occ()
that
handles ALA queries: changed query from full text query using
q=foo bar
to q=taxon_name="foo bar"
- in addition, improved
error handling as sometimes occurrences
slot is returned in
results but is empty, whereas before it seemd to always be
absent if no results (#178)ala
(#98)obis
(#155)occ2df()
more robust to varied inputs - allowing for users
that may on purpose or not have a subset of the data source slots
normally in the occdat
class object (#171)rvertnet
, a dependency dealing with data from Vertnet, was failing
on certain searches. rvertnet
was fixed and a new version on CRAN now.
No changes here other than requiring the new version of rvertnet
(#168)datetime
to observed_on
.is()
to inherits()
, and namespace all setNames()
callsrgbif::occ_data()
instead of rgbif::occ_search()
rvertnet::searchbyterm()
instead of
rgbif::vertsearch()
occ()
now allows queries that only pass from
and one of the data
source opts params (e.g., gbifopts
) - allows specifying any options
passed down to the internal functions used to do data queries without
having to use the other params in occ
(#163)tibble
for representing data.frames (#164)encoding="UTF-8"
in httr::content()
calls
to parse raw data from web requests (#160)ridigbio
as its on CRAN - was using
internal fxns prior to this (#154)has_coords
also
fixed. (#161)data.table::setDF()
instead of data.frame()
to set a data.table
style table to a data.frame
vertnet
as an option to occ_options()
to get the options for passing
to vertopts
in occ()
print.occdatind()
- which in last version introduced a bug in this
print method - wasn't fatal as only applied to empty slots in the output
of a call to occ()
, but nonetheless, not good (#159)data.table
for fast list to data.frameas.vertnet()
to coerce various inputs (e.g., result from occ()
, occ2df()
, or a key itself) to occurrence data objects (#142)occ()
gains two parameters start
and page
to facilitate paging
through results across data sources, instead of having to page
individually for each data source. Some sources use the start
parameter,
while others use the page
parameter. See Paging section in ?occ
for
details on Paging (#140)wkt_vis()
now works with WKT polygons with multipe polygons, e.g.,
spocc::wkt_vis("POLYGON((-125 38.4, -121.8 38.4, -121.8 40.9, -125 40.9, -125 38.4), (-115 22.4, -111.8 22.4, -111.8 30.9, -115 30.9, -115 22.4))")
(#147)print.occdatind()
to print more helpful info when a
geometry search is used as opposed to a taxonomy based search (#149)print.occdatind()
to not fail when first element not present;
proceeds to next slot with data (#143)occ()
failed when multiple geometry
elements
passed in along with taxonomic names (#146)occ2df()
for combining outputs to not fail when AntWeb
doesn't give back dates (#144) (#145) - thanks @timcdlucasocc2df()
to not fail when date field missing (#141)spocc
(#136) (#124)occ()
function. Each data source is taken care of in a separate package or set of wrapper functions, and the man file now details what API parameters are being queried (#138)Datetime
variable changed to datetime
occurrenceID
variable changed to occurrenceid
spoccutils
(https://github.com/ropensci/spoccutils) (#132)occ()
gains new parameter has_coords
- a global parameter (except for ebird and bison) to return only records with lat/long data. (#128)type
(#134) and rank
(#133) parameters dropped from occ()
occ()
is printed, we now include a message that total count of records found (not returned) is not completely known if ebird is included, because eBird does not include data on records found on their servers with requests to their API (#111)as.*()
(e.g., as.gbif
) for most data sources. These functions take in occurrence keys or sets of keys, and retrieve detailed occurrence record data for each key (#112)occ2df()
now returns more fields. This function collapses all essential fields that are easy to get in all data sources: name
, lat
, long
, prov
, date
, key
. The key
field is the occurrence key for each record, which you can use to keep track of individual records, get more data on the record, etc. (#103) (#108)inspect()
- takes output from occ()
or individual occurrence keys and gets detailed occurrence data.jsonlite
, V8
, utils
, and methods
. No longer importing: ggmap
, maptools
, rworldmap
, sp
, rgeos
, RColorBrewer
, rgdal
, and leafletR
. Pkgs removed mostly due to splitting off some functionality into spoccutils
. related issues: (#131) (#132)methods
, utils
(#120)occ2df()
(#106)wkt_vis()
now only has an option to view a WKT shape in the browser.assertthat
, plyr
, data.table
, and XML
(#102)gistr
now to post interactive geojson maps on GitHub gists (#100)rgbif
now must be v0.7.7
or greater (the latest version on CRAN).occ2sp()
removed. The function occ_to_sp()
function is the working version. (#97)\donttest
to \dontrun
in examples as requested by CRAN maintainers (#99)occ_names()
to search only for taxonomic names. The goal here is to use ths function if there is some question about what names you want to use to search for occurrences with. (#84). Suggested by @jarioksaocc_names_options()
to quickly get parameter options to pass to occ_names()
.summary()
method for the occdat
S3
object that is output from occ()
(#83)spocc
(README, vignette, occ()
documentation file, at package startup), we make it clear that there could be duplicate records returned in certain scenarios. And a new documentation page detailing what to watch out for: ?spocc_duplicates
. (#77)latitude
, decimalLatitude
, Latitude
, lat
, and decimal_latitude
. (#91)limit
parameter in occ()
(#78)limit
to each functions options parameter, and it will work. Each data source can have a different parameter internally from limit
, but now internally within spocc
, we allow you to use limit
so you don't have to know what the data source specific parameter is. (#81)occ_options()
gains new parameter where
to print either in the console or to open man file in the IDE, or prints to console in command line R.occ()
gains new parameter callopts
to pass on curl debugging options to httr::GET()
(#35)wkt_vis()
now by default plots a well known text area (WKT) on an interactive mapbox map in your default browser. New parameter which
allows you to choose the interactive map or a static ggplot2 map. (#70)occ()
gains new class. In the previous version of this package, a data.frame
was printed. Now the data is assigned the object occdatind
(short for occdat individual).occ()
now uses a print method for the occdatind
class, adopted from dplyr
that prints a brief data.frame
, with columns wrapped to fit the width of your console, and additional columns not printed given at bottom with their class type. Note that the print behavior for the resulting object of an occ()
call remains the same. (#69) (#74)whisker
as a package import to use in the wkt_vis()
function. (#70)mapggplot()
accepted the output of occ()
, of class occdat
, while the other two functions for mapping, mapleaflet()
and mapgist()
accepted a data.frame
. Now all three functions accept the output of occ()
, an object of class occdat
. (#75)meta
slot in each returned object (indexed by object$meta
) contains spots for returned
and found
, to designate number of records returned, and number of records found. (#64)name
. (#71)occ()
.spocc
depends on: rgbif
. A number of input and output parameter names changed. A new version of rgbif
was pushed to CRAN. (#56)clean_spocc()
started (not finished yet) to attempt to clean data. For example, one use case is removing impossible lat/long values (i.e., longitue values greater than absolute 180). Another, not implemented yet, is to remove points that are not in the country or habitat your points are supposed to be in. (#44)fixnames()
to trim species names with optional input parameters to make data easier to use for mapping.wkt_vis()
to visualize a WKT (well-known text) area on a map. Uses ggmap
to pull down a Google map so that the visualization has some geographic and natural earth context. We'll soon introduce an interactive version of this function that will bring up a small Shiny app to draw a WKT area, then return those coordinates to your R session. (#34)