Clustering Big Data using Expectation Maximization Star (EM*) Algorithm

Implements the Improved Expectation Maximisation EM* and the traditional EM algorithm for clustering big data (gaussian mixture models for both multivariate and univariate datasets). This version implements the faster alternative-EM* that expedites convergence via structure based data segregation. The implementation supports both random and K-means++ based initialization. Reference: Parichit Sharma, Hasan Kurban, Mehmet Dalkilic (2022) . Hasan Kurban, Mark Jenne, Mehmet Dalkilic (2016) .

Package Overview

Implements the Expectation Maximisation Algorithm for clustering the multivariate and univariate datasets. The package has been tested with numerical datasets (not recommended for categorical/ordinal data). The package comes bundled with a dataset for demostration (ionosphere_data.csv). More help about the package can be seen by typing ?DCEM in the R console (after installing the package).

Currently, data imputation is not supported and user has to handle the missing data before using the package.


For any Bug Fixes/Feature Update(s)

[Parichit Sharma: [email protected]edu]

For Reporting Issues


GitHub Repository Link

Github Repository

Installation Instructions

Installing from CRAN


Installing from the Binary Package

install.packages(dcem_1.0.0.tgz, repos = NULL, type="source")

How to use the package (An Example: working with the default bundled dataset)

  • The dcem package comes bundeled with the ionosphere_data.csv for demostration. Help about the dataset can be seen by typing ?ionosphere_data in the R console. Additional details can be seen at the link Ionosphere data

  • To use this dataset, paste the following code into the R console.

ionosphere_data = read.csv2(
  file = paste(trimws(getwd()),"/data/","ionosphere_data.csv",sep = ""),
  sep = ",",
  header = FALSE,
  stringsAsFactors = FALSE
  • Cleaning the data: Before the model can be trained (dcem_train() function), the data must be cleaned. This simply means to remove all redundant columns (example can be label colum). This datset contains labels in the last column (35th) and only 0's in the 2nd column so let's remove them,

Paste the below code in the R session to clean the dataset.

ionosphere_data =  trim_data("35,2", ionosphere_data)
  • Clustering the data: The dcem_train() learns the parameters of the Gaussian(s) from the input data. It internally calls the dcem_cluster_mv()or dcem_cluster_uv() function for multivariate and univariate data respectively. These functions assign(s) the probabilistic weights to the sample(s) in the dataset.

Paste the below code in the R session to call the dcem_train() function.

dcem_out = dcem_train(data = ionosphere_data, threshold = 0.0001, iteration_count = 50, num_clusters = 2)
  • Accessing the output: The list returned by the dcem_train() is stored in the dcem_out object. It contains the parameters associated with the clusters (Gaussian(s)). These parameters are namely - posterior probabilities, mean, co-variance (multivariate data) or standard-deviation (univariate data) and priors. Paste the following code in the R session to access any/all the output parameters.
          [1] Posterior Probabilities: `**dcem_out$prob**`: A matrix of posterior-probabilities for the 
              points in the dataset.
          [2] Mean(s): `**dcem_out$mean**`
              For multivariate data: It is a matrix of means for the gaussians. Each row in the  
              matrix corresponds to a mean for the gaussian.
              For univariate data: It is a vector if means. Each element of the vector corresponds 
              to one gaussian.
          [3] Co-variance matrices 
              For multivariate data: `**dcem_out$cov**`: list of co-variance matrices for the gaussians.
              For univariate data: Standard-deviation `**dcem_out$sd**`: vector of standard deviation(s) 
              for the gaussians.
          [4] Priors: `**dcem_out$prior**`: a vector of priors for the gaussians.

How to access the help (after installing the package)



DCEM 0.0.1

This is the first stable realease of the DCEM package.

Major Features

Support clustering of both multivariate and univariate data for finite gaussian misxture models.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


2.0.5 by Sharma Parichit, 4 days ago

Report a bug at

Browse source code at

Authors: Sharma Parichit [aut, cre, ctb] , Kurban Hasan [aut, ctb] , Dalkilic Mehmet [aut]

Documentation:   PDF Manual  

GPL-3 license

Imports mvtnorm, matrixcalc, MASS, Rcpp

Suggests knitr, rmarkdown

Linking to Rcpp

See at CRAN