This is an extremely fast implementation of a Naive Bayes classifier. This
package is currently the only package that supports a Bernoulli distribution, a Multinomial
distribution, and a Gaussian distribution, making it suitable for both binary features,
frequency counts, and numerical features. Another feature is the support of a mix of
different event models. Only numerical variables are allowed, however, categorical variables
can be transformed into dummies and used with the Bernoulli distribution. This implementation
offers a huge performance gain compared to other implementations in R. The execution times
were compared on a data set of tweets and this package was found to be around 283 to 34,841
times faster for the Bernoulli event models and 17 to 60 times faster for the Multinomial model.
See the vignette for more details. For the Gaussian distribution this package was found to be
between 2.8 and 1679 times faster. The implementation is largely based on the paper
"A comparison of event models for Naive Bayes anti-spam e-mail filtering"
written by K.M. Schneider (2003)

This is an extremely fast implementation of a Naive Bayes classifier. This package is currently the only package that supports a Bernoulli distribution, a Multinomial distribution, and a Gaussian distribution, making it suitable for both binary features, frequency counts, and numerical features. Another feature is the support of a mix of different event models. Only numerical variables are allowed, however, categorical variables can be transformed into dummies and used with the Bernoulli distribution.

This implementation offers a huge performance gain compared to other implementations in R. The execution times were compared on a data set of tweets and this package was found to be around 283 to 34,841 times faster for the Bernoulli event models and 17 to 60 times faster for the Multinomial model. For the Gaussian distribution this package was found to be between 2.8 and 1679 times faster. See the vignette for more details. The implementation is largely based on the paper "A comparison of event models for Naive Bayes anti-spam e-mail filtering" written by K.M. Schneider (2003).

Any issues can be submitted to: https://github.com/mskogholt/fastNaiveBayes/issues.

Install the package with:

install.packages("fastNaiveBayes")

Or install the development version using devtools with:

library(devtools)devtools::install_github("mskogholt/fastNaiveBayes")

rm(list=ls())library(fastNaiveBayes)cars <- mtcarsy <- as.factor(ifelse(cars$mpg>25,'High','Low'))x <- cars[,2:ncol(cars)]dist <- fastNaiveBayes::fastNaiveBayes.detect_distribution(x, nrows = nrow(x))print(dist)mod <- fastNaiveBayes.mixed(x,y,laplace = 1)pred <- predict(mod, newdata = x)mean(pred!=y)# Bernoulli onlyvars <- c(dist$bernoulli, dist$multinomial)newx <- x[,vars]for(i in 1:ncol(newx)){newx[[i]] <- as.factor(newx[[i]])}new_mat <- model.matrix(y ~ . -1, cbind(y,newx))mod <- fastNaiveBayes.bernoulli(new_mat, y, laplace = 1)pred <- predict(mod, newdata = new_mat)mean(pred!=y)# Construction sparse Matrix:mod <- fastNaiveBayes.bernoulli(new_mat, y, laplace = 1, sparse = TRUE)pred <- predict(mod, newdata = new_mat)mean(pred!=y)# OR:new_mat <- Matrix::Matrix(as.matrix(new_mat), sparse = TRUE)mod <- fastNaiveBayes.bernoulli(new_mat, y, laplace = 1)pred <- predict(mod, newdata = new_mat)mean(pred!=y)# Multinomial onlyvars <- c(dist$bernoulli, dist$multinomial)newx <- x[,vars]mod <- fastNaiveBayes.multinomial(newx, y, laplace = 1)pred <- predict(mod, newdata = newx)mean(pred!=y)# Gaussian onlyvars <- c('hp', dist$gaussian)newx <- x[,vars]mod <- fastNaiveBayes.gaussian(newx, y)pred <- predict(mod, newdata = newx)mean(pred!=y)

- threshold in all predict functions to ensure a minimum probability
- Added tweets and tweetsDTM datasets as example data and for time comparisons
- Changed Gaussian model to achieve a huge speed-up
- Removed inefficiencies for both the Bernoulli and Multinomial models. Much faster now.

- With 2x1 matrices error were thrown

- Removed std_threshold in Gaussian model, not necessary since the introduction of the above threshold feature
- Changed comparison to other packages in vignette

- Detect distribution. Automatically determine the distributions of a matrix for use with mixed Naive Bayes model
- A threshold for the standard deviation for the Gaussian event model. This way one can ensure that probabilities are real numbers and not NaN's due to standard deviation being 0.

- Expanded unit tests.
- Changed comparison to other packages in vignette
- small change to bernoulli predict function

- Fixed bug in Gaussian predict function.

- Changed Readme
- Changed description
- Added unit tests and Travis-ci

Initial Release of package