Provides measures to characterize the complexity of classification
and regression problems based on aspects that quantify the linearity of the
data, the presence of informative feature, the sparsity and dimensionality
of the datasets. This package provides bug fixes, generalizations and
implementations of many state of the art measures. The measures are
described in the papers: Lorena et al. (2019)
The Extended Complexity Library (ECoL) is the implementation in R of a set of measures to characterize the complexity of classification and regression problems based on aspects that quantify the linearity of the data, the presence of informative feature, the sparsity and dimensionality of the datasets. The measures were originally proposed by Ho and Basu [1] and extend by many other works including the DCoL library [2]. The main difference between the libraries is that ECoL provides bug fixes, generalizations and implementations of many other state-of-the-art measures.
The measures can be divided into two groups: classification and regression measures. The classification measures are based on: (1) feature overlapping measures, (2) neighborhood measures, (3) linearity measures, (4) dimensionality measures, (5) class balance measures and (6) network measures. The regression measures are based on: (3) linearity measures, (4) dimensionality measures, (7) correlation measures and (8) smoothness measures.
Measures of overlapping
Measures of neighborhood information
Measures of linearity
Measures of dimensionality
Measures of class balance
Measures of structural representation
Measures of feature correlation
Measures of smoothness
The installation process is similar to other packages available on CRAN:
install.packages("ECoL")
It is possible to install the development version using:
if (!require("devtools")) { install.packages("devtools")}devtools::install_github("lpfgarcia/ECoL")library("ECoL")
The simplest way to compute the complexity measures are using the complexity
method. The method can be called by a symbolic description of the model or by a data frame. The parameters are the dataset, the type of task and the group of measures to be extracted. If it is a classification task, type
needs to be set as class
, otherwise regr
for regression task. The default paramenter is extract all the measures. To extract a specific measure, use the function related with the group. A simple example is given next:
## Extract all complexity measures for classification taskcomplexity(Species ~ ., iris, type="class") ## Extract all complexity measures for regression taskcomplexity(speed ~., cars, type="regr") ## Extract all complexity measures using data frame for classification taskcomplexity(iris[,1:4], iris[,5], type="class") ## Extract the overlapping measurescomplexity(Species ~ ., iris, type="class", groups="overlapping") ## Extract the F1 measure using overlapping functionoverlapping(Species ~ ., iris, measures="F1")
To cite ECoL
in publications use:
Lorena, A. C., Garcia, L. P. F., Lehmann, J., de Souto, M. C. P., and Ho, T. K. (2018). How Complex is your classification problem? A survey on measuring classification complexity. arXiv:1808.03591
Lorena, A. C., Maciel, A. I., de Miranda, P. B. C., Costa, I. G., and Prudêncio, R. B. C. (2018). Data complexity meta-features for regression problems. Machine Learning, 107(1):209-246.
To submit bugs and feature requests, report at project issues.
[1] Ho, T., and Basu, M. (2002). Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289-300.
[2] Orriols-Puig, A., Maciá, N., and Ho, T. (2010). Documentation for the data complexity library in C++. Technical report, La Salle - Universitat Ramon Llull.
[3] R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.