Potential outliers are identified for all combinations of a dataset's variables. O3 plots are described in Unwin(2019)
OutliersO3 is for visualising results of outlier analyses. Overview of Outliers (O3) plots show which cases are identified as potential outliers for different combinations of variables from a dataset.
You can compare sets of outliers identified by up to six different methods. You can also compare results for a single method at up to three different tolerance levels.
install.packages("OutliersO3")
Flury and Riedwyl introduced the famous banknote dataset in their excellent book on multivariate statistics. There are measurements on 100 genuine banknotes and on 100 counterfeit banknotes. Presumably the genuine notes should all be very similar.
The method mvBACON from robustX has been used to identify possible outliers. There are 6 numeric measurements of the notes, so there are 63 possible variable combinations. An O3 plot has one row for each variable combination for which outliers were found and those variables are specified by the relevant columns on the left of the plot. The cases identified as outliers for at least one combination each get a column to the right of the plot.
library(OutliersO3)data(banknote, package="mclust")data <- banknote %>% filter(Status=="genuine") %>% select(-Status)pB <- O3prep(data, method="BAC", tols=c(0.05, 0.01, 0.001), boxplotLimits=c(6,10,12))pX <- O3plotT(pB)pX$gO3
The O3 plot shows outliers found by the mvBACON method for three tolerance levels. Two banknotes, X71 and X5, are only identified for a few combinations mainly at a level of 0.05. A further banknote, X40 is identified more often, mostly at a level of 0.01. Two banknotes, X1 and X70, were identified as outliers at a level of 0.001 for the combination of attributes Length and Right and for Diagonal on its own respectively. When X1 is identified as an outlier at other levels the attribute Right is always involved. The supporting parallel coordinate plot suggests why:
pX$gpcp
This plot also suggests that all five cases identified as potential outliers are relatively extreme on at least one of the six attributes.
There are more examples in the package vignettes.
The authors of the package cellWise have renamed their function DetectDeviatingCells DDC and this required a minor code change in OutliersO3 and amendments to help files.
(Thanks to anonymous referees and not-so-anonymous JCGS editors for suggesting some of these changes.)
Use mvBacon and covMcd for their respective 1-d outliers and not boxplot limits.
Changed the chi-square degrees of freedom used in setting thresholds for covMcd to number of variables in subspace from number of variables in dataset.
Added legends to O3 plots and no longer include plot titles.
There are now two white columns separating the left and right blocks of O3 plots.
Added the options sortVars (whether variables in an O3 plot are sorted or not) and coltxtsize (text size of column names) to O3plotT and O3plotM.
Simplified entry for the boxplotLimits parameter vector.
Corrected the index entries for the vignettes so that they are all different.
Relaxed the dependence on R from ≥ R 3.4.0 to ≥ R 3.3.0 to ensure that the package would run under r-oldrel-osx-x86_64 and thus pass all CRAN package checks.
There are three major changes to the package.
The main function O3plot has been split into three parts. O3prep sets up a list of lists of outlier indices and distances to be plotted by O3plotM (if more than one outlier method is involved) or by O3plotT (if the plot is for one method and up to 3 different tolerance levels). Separating preparation and plotting means that users can prepare outlier results with their own methods and code to then draw and O3 plot with this package.
If an O3 plot is to be drawn for more than one method, then the outlier tolerance levels can be set individually for each method. This is now the default as using the same levels for all methods was rarely satisfactory. A vignette has been added to illustrate this.
Argument names have been "R-standardised" and internal functions hidden. (Thanks to Michael Friendly for helpful comments and to Bill Venables for sage advice.)
The output now includes a table of all outliers found by case, variable combination, and method/tolerance level. It also includes a three-dimensional array of all the outlier distances/scores by variable combination by method for all methods that provide them.
Two further plots have been added, an O3 plot for more than two methods in which only case combinations identified by at least two outliers are displayed, and a parallel coordinate plot of outlier distances (when provided by methods) either by methods (O3plotM) or by combinations (O3plotT).
The method covMcd{robustbase} has been added as an optional method.
For an O3 plot with one method and more than one outlier level, you can enter the levels in the arguments tols and boxplotLimits in any order. (Thanks to Tae-Rae Kim for this suggestion.)
For an O3 plot with more than one method, you can enter the methods in any order.
If a dataset includes case identifiers, then these can be used as labels for the case columns in the O3 plot through the argument caseNames.
O3 plot rows are now sorted by numbers of outliers within size of variable combinations. (Thanks to Nina Wu for pointing out that this was not working as intended.)
The default colours in an O3 plot for comparing two methods have been changed to be more distinctive.
Colours can be specified using either colours or colors (or col) in O3plotColours.
First version on CRAN.