Statistical Inference and Sure Independence Screening via Ball Statistics

Hypothesis tests and sure independence screening (SIS) procedure based on ball statistics, including ball divergence , ball covariance , and ball correlation , are developed to analyze complex data in metric spaces, e.g, shape, directional, compositional and symmetric positive definite matrix data. The ball divergence and ball covariance based distribution-free tests are implemented to detecting distribution difference and association in metric spaces . Furthermore, several generic non-parametric feature selection procedures based on ball correlation, BCor-SIS and all of its variants, are implemented to tackle the challenge in the context of ultra high dimensional data.

Ball Statistics

Travis Build Status AppVeyor Build Status CRAN Status Badge The fundamental problems for data mining and statistical analysis are:

  • Whether distributions of two samples are distinct?

  • Whether two random variables are dependent?

Ball package provides solutions for these issues. Moreover, a variable screening (or feature screening) procedure is also implemented to tackle ultra high dimensional data. The core functions in Ball package are bd.test, bcov.test, and bcorsis.

These functions based on ball statistic have several advantages:

  • It's applicable to univariate and multivariate data in Banach space.

  • There is no need for moment assumption, which means that outliers and heavy-tail data are no longer a problem.

  • They perform well in many setting without complex adjustments for parameters.

Particularly, for two-sample or K-sample problem, bd.test has been proved to cope well for imbalanced data, and bcov.test and bcorsis work well for detecting the relationship between complex responses and/or predictors, such as shape, compositional as well as censored data.


CRAN version

To install the Ball R package from CRAN, just run:


Github version

To install the development version from GitHub, run:

install_github("Mamba413/Ball/R-package", build_vignettes = TRUE)

Windows user will need to install Rtools first.


Take iris dataset as an example to illustrate how to use bd.test and bcov.test to deal with the fundamental problems mentioned above.


virginica <- iris[iris$Species == "virginica", "Sepal.Length"]
versicolor <- iris[iris$Species == "versicolor", "Sepal.Length"]
bd.test(virginica, versicolor)

In this example, bd.test examines the assumption that Sepal.Length distributions of versicolor and virginica are equal.

If the assumption invalid, the p-value of the bd.test will be under 0.05.

In this example, the result is:

    2-Samples Ball Divergence Test

data:  virginica and versicolor 
number of observations = 100, group sizes: 50 50
replicates = 99
bd = 0.32912, p-value = 0.01
alternative hypothesis: distributions of samples are distinct

The R output shows that p-value is under 0.05. Consequently, we can conclude that the Sepal.Length distribution of versicolor and virginica are distinct.


sepal <- iris[, c("Sepal.Width", "Sepal.Length")]
petal <- iris[, c("Petal.Width", "Petal.Length")]
bcov.test(sepal, petal)

In this example, bcov.test investigates whether width or length of petal is associated with width and length of sepal. If the dependency really exists, the p-value of the bcov.test will be under 0.05.

In this example, the result is:

    Ball Covariance test of independence

data:  sepal and petal
number of observations = 150
replicates = 99, Weighted Ball Covariance = FALSE
bcov = 0.0081472, p-value = 0.01
alternative hypothesis: random variables are dependent

Therefore, the relationship between width and length of sepal and petal exists.


We generate a dataset and demonstrate the usage of bcorsis function as follow.

## simulate a ultra high dimensional dataset:
n <- 150
p <- 3000
x <- matrix(rnorm(n * p), nrow = n)
error <- rnorm(n)
y <- 3*x[, 1] + 5*(x[, 3])^2 + error
## BCor-SIS procedure:
res <- bcorsis(y = y, x = x)
head(res[["ix"]], n = 5)

In this example, the result is :

# [1]    3    1 1601   20  429

The bcorsis result shows that the first and the third variable are the two most important variables in 3000 explanatory variables which is consistent to the simulation settings.

If you find any bugs, or if you experience any crashes, please report to us. If you have any questions just ask, we won't bite.




Ball 1.3.7

  • Modify ambiguous arguments of bcov.test, bd.test, bcorsis
  • Modify document

Ball 1.3.6

  • Formula interface for bd.test and bcov.test
  • Optimize the package dependency

Ball 1.3.5

  • Faster implementation of mutual independence test
  • Multi-thread support for the test of mutual independence
  • Modify document

Ball 1.3.0

  • Add a KBD statistic designed for detecting the distribution distinction when a part of group distributions are identical. (setting kbd.type = "maxsum")
  • OPENMP based Multi-thread support for KBD
  • Optimized OPENMP parallelism

Ball 1.2.0

  • Speed up feature screening.
  • Speed up mutual independence test.
  • Another K-sample test statistic (setting kbd.type = "max") is implemented. It is good at detecting the distribution distinction when a part of group distributions are identical.
  • Add component in the output list of bcov.test, bd.test and bcorsis such that user not need to re-run them when user want to obtain the test of result of different statistics.

Ball 1.1.0

  • Bug fix
  • OPENMP based Multi-thread support for bd.test and bcov.test
  • Speed up feature screening for survival data
  • Implement Angular metric for compostional data

Ball 1.0.0

  • Initial CRAN version

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


1.3.8 by Jin Zhu, 16 days ago

Report a bug at

Browse source code at

Authors: Xueqin Wang , Wenliang Pan , Heping Zhang , Hongtu Zhu , Yuan Tian , Weinan Xiao , Chengfeng Liu , Ruihuang Liu , Jin Zhu

Documentation:   PDF Manual  

GPL-3 license

Imports utils, gam, survival, mvtnorm

Suggests knitr, rmarkdown, testthat

See at CRAN