'Elasticsearch' is an open-source, distributed, document-based datastore (< https://www.elastic.co/products/elasticsearch>). It provides an 'HTTP' 'API' for querying the database and extracting datasets, but that 'API' was not designed for common data science workflows like pulling large batches of records and normalizing those documents into a data frame that can be used as a training dataset for statistical models. 'uptasticsearch' provides an interface for 'Elasticsearch' that is explicitly designed to make these data science workflows easy and fun.
This project tackles the issue of getting data out of Elasticsearch and into a tabular format in R.
The core functionality of this package is the es_search
function. This returns a data.table
containing the parsed result of any given query. Note that this includes aggs
queries.
Releases of this package can be installed from CRAN:
install.packages('uptasticsearch')
To use the development version of the package, which has the newest changes, you can install directly from GitHub
devtools::install_github("UptakeOpenSource/uptasticsearch", subdir = "r-pkg")
This package is not currently available on PyPi. To build the development version from source, clone this repo, then :
cd py-pkg
pip install .
The examples presented here pertain to a fictional Elasticsearch index holding some information on a movie theater business.
The most common use case for this package will be the case where you have an ES query and want to get a data frame representation of many resulting documents.
In the example below, we use uptasticsearch
to look for all survey results in which customers said their satisfaction was "low" or "very low" and mentioned food in their comments.
library(uptasticsearch)
# Build your query in an R string
qbody <- '{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "customer_comments"
}
},
{
"terms": {
"overall_satisfaction": ["very low", "low"]
}
}
]
}
}
},
"query": {
"match_phrase": {
"customer_comments": "food"
}
}
}
}'
# Execute the query, parse into a data.table
commentDT <- es_search(
es_host = 'http://mydb.mycompany.com:9200'
, es_index = "survey_results"
, query_body = qbody
, scroll = "1m"
, n_cores = 4
)
Elasticsearch ships with a rich set of aggregations for creating summarized views of your data. uptasticsearch
has built-in support for these aggregations.
In the example below, we use uptasticsearch
to create daily timeseries of summary statistics like total revenue and average payment amount.
library(uptasticsearch)
# Build your query in an R string
qbody <- '{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "pmt_amount"
}
}
]
}
}
}
},
"aggs": {
"timestamp": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"revenue": {
"extended_stats": {
"field": "pmt_amount"
}
}
}
}
},
"size": 0
}'
# Execute the query, parse result into a data.table
revenueDT <- es_search(
es_host = 'http://mydb.mycompany.com:9200'
, es_index = "transactions"
, size = 1000
, query_body = qbody
, n_cores = 1
)
In the example above, we used the date_histogram and extended_stats aggregations. es_search
has built-in support for many other aggregations and combinations of aggregations, with more on the way. Please see the table below for the current status of the package. Note that names of the form "agg1 - agg2" refer to the ability to handled aggregations nested inside other aggregations.
Agg type | R support? | Python support? |
---|---|---|
"cardinality" | YES | NO |
"date_histogram" | YES | NO |
date_histogram - cardinality | YES | NO |
date_histogram - extended_stats | YES | NO |
date_histogram - histogram | YES | NO |
date_histogram - percentiles | YES | NO |
date_histogram - significant_terms | YES | NO |
date_histogram - stats | YES | NO |
date_histogram - terms | YES | NO |
"extended_stats" | YES | NO |
"histogram" | YES | NO |
"percentiles" | YES | NO |
"significant terms" | YES | NO |
"stats" | YES | NO |
"terms" | YES | NO |
terms - cardinality | YES | NO |
terms - date_histogram | YES | NO |
terms - date_histogram - cardinality | YES | NO |
terms - date_histogram - extended_stats | YES | NO |
terms - date_histogram - histogram | YES | NO |
terms - date_histogram - percentiles | YES | NO |
terms - date_histogram - significant_terms | YES | NO |
terms - date_histogram - stats | YES | NO |
terms - date_histogram - terms | YES | NO |
terms - extended_stats | YES | NO |
terms - histogram | YES | NO |
terms - percentiles | YES | NO |
terms - significant_terms | YES | NO |
terms - stats | YES | NO |
terms - terms | YES | NO |
closeAllConnections()
in unit tests because they were superfluous and causing problems on certain operating systems in the CRAN check farm.unique(outDT)
to unique(outDT, by = "_id")
. This was prompted by Rdatatable/data.table#3332 (changes in data.table
1.12.0), but it's actually faster and safer anyway!Content-Type
header. Previous versions of ES tried to guess the Content-Type
when none was declareduptasticsearch
will now hit the cluster to try to figure out which version of ES it is running, then use the appropriate scrolling strategy.get_fields
when your index has no aliasesget_fields
broke on some legacy versions of Elasticsearch where no aliases had been created. The response on the _cat/aliases
endpoint has changed from major version to major version. #66 fixed this for all major versions of ES from 1.0 to 6.2get_fields
when your index has multiple aliasesget_fields
would only return one of those. As of #73, mappings for the underlying physical index will now be duplicated once per alias in the table returned by get_fields
.uptasticsearch
attempts to query the ES host to figure out what major version of Elasticsearch is running there. Implementation errors in that PR led to versions being parsed incorrectly but silently passing tests. This was fixed in #66. NOTE: this only impacted the dev version of the library on Github.ignore_scroll_restriction
not being respecteduptasticsearch
, the value passed to es_search
for ignore_scroll_restriction
was not actually respected. This was possible because an internal function had defaults specified, so we never caught the fact that that value wasn't getting passed through. #66 instituted the practice of not specifying defaults on function arguments in internal functions, so similar bugs won't be able to silently get through testing in the future.get_counts
. This function was outside the core mission of the package and exposed us unnecessarily to changes in the Elasticsearch DSLunpack_nested_data
httr::RETRY
instead of one-shot POST
or GET
callsget_fields
returns a data.table with the names and types of all indexed fields across one or more indiceses_search
now accepts an intermediates_dir
parameter, giving users control over the directory used for temporary I/O at query timees_search
executes an ES query and gets a data.tablechomp_aggs
converts a raw aggs JSON to data.tablechomp_hits
converts a raw hits JSON to data.tableunpack_nested_data
deals with nested Elasticsearch data not in a tabular formatparse_date_time
parses date-times from Elasticsearch recordsget_counts
examines the distribution of distinct values for a field in Elasticsearch