Tidy Anomaly Detection

The 'anomalize' package enables a "tidy" workflow for detecting anomalies in data. The main functions are time_decompose(), anomalize(), and time_recompose(). When combined, it's quite simple to decompose time series, detect anomalies, and create bands separating the "normal" data from the anomalous data at scale (i.e. for multiple time series). Time series decomposition is used to remove trend and seasonal components via the time_decompose() function and methods include seasonal decomposition of time series by Loess ("stl") and seasonal decomposition by piecewise medians ("twitter"). The anomalize() function implements two methods for anomaly detection of residuals including using an inner quartile range ("iqr") and generalized extreme studentized deviation ("gesd"). These methods are based on those used in the 'forecast' package and the Twitter 'AnomalyDetection' package. Refer to the associated functions for specific references for these methods.


Travis buildstatus Coveragestatus CRAN_Status_Badge

anomalize enables a tidy workflow for detecting anomalies in data. The main functions are time_decompose(), anomalize(), and time_recompose(). When combined, it’s quite simple to decompose time series, detect anomalies, and create bands separating the “normal” data from the anomalous data.

Anomalize In 2 Minutes (YouTube)

Anomalize

Check out our entire Software Intro Series on YouTube!

Installation

You can install the development version with devtools or the most recent CRAN version with install.packages():

# devtools::install_github("business-science/anomalize")
install.packages("anomalize")

How It Works

anomalize has three main functions:

  • time_decompose(): Separates the time series into seasonal, trend, and remainder components
  • anomalize(): Applies anomaly detection methods to the remainder component.
  • time_recompose(): Calculates limits that separate the “normal” data from the anomalies!

Getting Started

Load the tidyverse and anomalize packages.

library(tidyverse)
library(anomalize)

Next, let’s get some data. anomalize ships with a data set called tidyverse_cran_downloads that contains the daily CRAN download counts for 15 “tidy” packages from 2017-01-01 to 2018-03-01.

tidyverse_cran_downloads %>%
    ggplot(aes(date, count)) +
    geom_point(color = "#2c3e50", alpha = 0.25) +
    facet_wrap(~ package, scale = "free_y", ncol = 3) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
    labs(title = "Tidyverse Package Daily Download Counts",
         subtitle = "Data from CRAN by way of cranlogs package")

Suppose we want to determine which daily download “counts” are anomalous. It’s as easy as using the three main functions (time_decompose(), anomalize(), and time_recompose()) along with a visualization function, plot_anomalies().

tidyverse_cran_downloads %>%
    # Data Manipulation / Anomaly Detection
    time_decompose(count, method = "stl") %>%
    anomalize(remainder, method = "iqr") %>%
    time_recompose() %>%
    # Anomaly Visualization
    plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) +
    labs(title = "Tidyverse Anomalies", subtitle = "STL + IQR Methods") 

If you’re familiar with Twitter’s AnomalyDetection package, you can implement that method by combining time_decompose(method = "twitter") with anomalize(method = "gesd"). Additionally, we’ll adjust the trend = "2 months" to adjust the median spans, which is how Twitter’s decomposition method works.

# Get only lubridate downloads
lubridate_dloads <- tidyverse_cran_downloads %>%
    filter(package == "lubridate") %>% 
    ungroup()
 
# Anomalize!!
lubridate_dloads %>%
    # Twitter + GESD
    time_decompose(count, method = "twitter", trend = "2 months") %>%
    anomalize(remainder, method = "gesd") %>%
    time_recompose() %>%
    # Anomaly Visualziation
    plot_anomalies(time_recomposed = TRUE) +
    labs(title = "Lubridate Anomalies", subtitle = "Twitter + GESD Methods")

Last, we can compare to STL + IQR methods, which use different decomposition and anomaly detection approaches.

lubridate_dloads %>%
    # STL + IQR Anomaly Detection
    time_decompose(count, method = "stl", trend = "2 months") %>%
    anomalize(remainder, method = "iqr") %>%
    time_recompose() %>%
    # Anomaly Visualization
    plot_anomalies(time_recomposed = TRUE) +
    labs(title = "Lubridate Anomalies", subtitle = "STL + IQR Methods")

But Wait, There’s More!

There are a several extra capabilities:

  • time_frequency() and time_trend() for generating frequency and trend spans using date and datetime information, which is more intuitive than selecting numeric values. Also, period = "auto" automatically selects frequency and trend spans based on the time scale of the data.
# Time Frequency
time_frequency(lubridate_dloads, period = "auto")
#> frequency = 7 days
#> [1] 7
# Time Trend
time_trend(lubridate_dloads, period = "auto")
#> trend = 91 days
#> [1] 91
  • plot_anomaly_decomposition() for visualizing the inner workings of how algorithm detects anomalies in the “remainder”.
tidyverse_cran_downloads %>%
    filter(package == "lubridate") %>%
    ungroup() %>%
    time_decompose(count) %>%
    anomalize(remainder) %>%
    plot_anomaly_decomposition() +
    labs(title = "Decomposition of Anomalized Lubridate Downloads")
  • Vector functions for anomaly detection: iqr() and gesd(). These are great for just using on numeric data. Note that trend and seasonality should already be removed for time series data.
# Data with outliers
set.seed(100)
x <- rnorm(100)
idx_outliers <- sample(100, size = 5)
x[idx_outliers] <- x[idx_outliers] + 10
 
# IQR method
iqr(x, alpha = 0.05, max_anoms = 0.2)
#>   [1] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [12] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [23] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "No"  "No" 
#>  [34] "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [45] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [56] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [67] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [78] "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [89] "No"  "Yes" "No"  "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No" 
#> [100] "No"
  • Anomaly Reports: Using verbose = TRUE, we can return a nice report of useful information related to the outliers.
lubridate_dloads %>%
    time_decompose(count) %>%
    anomalize(remainder, verbose = TRUE)
#> $anomalized_tbl
#> # A time tibble: 425 x 8
#> # Index: date
#>    date       observed season trend remainder remainder_l1 remainder_l2
#>    <date>        <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl>
#>  1 2017-01-01     643. -2078. 2474.      246.       -3323.        3310.
#>  2 2017-01-02    1350.   518. 2491.    -1659.       -3323.        3310.
#>  3 2017-01-03    2940.  1117. 2508.     -685.       -3323.        3310.
#>  4 2017-01-04    4269.  1220. 2524.      525.       -3323.        3310.
#>  5 2017-01-05    3724.   865. 2541.      318.       -3323.        3310.
#>  6 2017-01-06    2326.   356. 2558.     -588.       -3323.        3310.
#>  7 2017-01-07    1107. -1998. 2574.      531.       -3323.        3310.
#>  8 2017-01-08    1058. -2078. 2591.      545.       -3323.        3310.
#>  9 2017-01-09    2494.   518. 2608.     -632.       -3323.        3310.
#> 10 2017-01-10    3237.  1117. 2624.     -504.       -3323.        3310.
#> # ... with 415 more rows, and 1 more variable: anomaly <chr>
#> 
#> $anomaly_details
#> $anomaly_details$outlier
#>   [1] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [12] "Yes" "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [23] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [34] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [45] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [56] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [67] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [78] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [89] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [100] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "No" 
#> [111] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [122] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [133] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [144] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [155] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [166] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [177] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [188] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [199] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [210] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [221] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [232] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [243] "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "Yes" "No"  "No"  "No" 
#> [254] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [265] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [276] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [287] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [298] "No"  "No"  "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No" 
#> [309] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "Yes" "No" 
#> [320] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [331] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "Yes" "No"  "No" 
#> [342] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [353] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "No"  "No" 
#> [364] "No"  "No"  "Yes" "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No" 
#> [375] "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [386] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [397] "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "Yes" "Yes" "Yes" "No" 
#> [408] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [419] "Yes" "Yes" "No"  "No"  "No"  "No"  "No" 
#> 
#> $anomaly_details$outlier_idx
#>  [1] 419 405 370 420 303 406 250 366 318 244 338 317  12 378 339 404 403
#> [18] 361 109
#> 
#> $anomaly_details$outlier_vals
#>  [1] -8518.886 -7779.522 -6293.275 -6218.430  5557.429 -5477.838  4619.824
#>  [8] -4553.173  4240.767 -4136.721  3804.789  3626.129 -3522.194  3494.339
#> [15]  3486.598  3477.376  3385.065 -3381.355  3347.284
#> 
#> $anomaly_details$outlier_direction
#>  [1] "Down" "Down" "Down" "Down" "Up"   "Down" "Up"   "Down" "Up"   "Down"
#> [11] "Up"   "Up"   "Down" "Up"   "Up"   "Up"   "Up"   "Down" "Up"  
#> 
#> $anomaly_details$critical_limits
#> limit_lower limit_upper 
#>   -3323.425    3310.268 
#> 
#> $anomaly_details$outlier_report
#> # A tibble: 85 x 7
#>     rank index  value limit_lower limit_upper outlier direction
#>    <dbl> <dbl>  <dbl>       <dbl>       <dbl> <chr>   <chr>    
#>  1    1.  419. -8519.      -3323.       3310. Yes     Down     
#>  2    2.  405. -7780.      -3323.       3310. Yes     Down     
#>  3    3.  370. -6293.      -3323.       3310. Yes     Down     
#>  4    4.  420. -6218.      -3323.       3310. Yes     Down     
#>  5    5.  303.  5557.      -3323.       3310. Yes     Up       
#>  6    6.  406. -5478.      -3323.       3310. Yes     Down     
#>  7    7.  250.  4620.      -3323.       3310. Yes     Up       
#>  8    8.  366. -4553.      -3323.       3310. Yes     Down     
#>  9    9.  318.  4241.      -3323.       3310. Yes     Up       
#> 10   10.  244. -4137.      -3323.       3310. Yes     Down     
#> # ... with 75 more rows

References

Several other packages were instrumental in developing anomaly detection methods used in anomalize:

  • Twitter’s AnomalyDetection, which implements decomposition using median spans and the Generalized Extreme Studentized Deviation (GESD) test for anomalies.
  • forecast::tsoutliers() function, which implements the IQR method.

News

anomalize 0.1.1

  • Issue #2: Bugfixes for various ggplot2 issues in plot_anomalies(). Solves "Error in FUN(X[[i]], ...) : object '.group' not found".
  • Issue #6: Bugfixes for invalid unary operator error in plot_anomaly_decomposition(). Solves "Error in -x : invalid argument to unary operator".

anomalize 0.1.0

  • Added a NEWS.md file to track changes to the package.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("anomalize")

0.1.1 by Matt Dancho, a year ago


https://github.com/business-science/anomalize


Report a bug at https://github.com/business-science/anomalize/issues


Browse source code at https://github.com/cran/anomalize


Authors: Matt Dancho [aut, cre] , Davis Vaughan [aut]


Documentation:   PDF Manual  


Task views: Time Series Analysis


GPL (>= 3) license


Imports dplyr, glue, timetk, sweep, tibbletime, purrr, rlang, tibble, tidyr, ggplot2

Suggests tidyverse, tidyquant, testthat, covr, knitr, rmarkdown, devtools, roxygen2


See at CRAN