Apache Ignite Documentation

GridGain Developer Hub - Apache Ignitetm

Welcome to the Apache Ignite developer hub run by GridGain. Here you'll find comprehensive guides and documentation to help you start working with Apache Ignite as quickly as possible, as well as support if you get stuck.

 

GridGain also provides Community Edition which is a distribution of Apache Ignite made available by GridGain. It is the fastest and easiest way to get started with Apache Ignite. The Community Edition is generally more stable than the Apache Ignite release available from the Apache Ignite website and may contain extra bug fixes and features that have not made it yet into the release on the Apache website.

 

Let's jump right in!

 

Documentation     Ask a Question     Download

 

Javadoc     Scaladoc     Examples

K-Means Clustering

The Apache Ignite Machine Learning component provides a K-Means clustering algorithm implementation.

Model

K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

The model holds a vector of k centers and one of the distance metrics provided by the ML framework such as Euclidean, Hamming or Manhattan.

It enables predictions for a given vector of features in the following way:

KMeansModel mdl = ...;

double prediction = model.predict(observation);

Trainer

KMeans is a unsupervised learning algorithm. It solves a clustering task which is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

KMeans is a parametrized iterative algorithm which calculates the new means to be the centroids of the observations in the clusters on each iteration.

Presently, Ignite supports a few parameters for the KMeans classification algorithm:

  • k - a number of possible clusters
  • maxIterations - one stop criteria (the other one is epsilon)
  • epsilon - delta of convergence (delta between old and new centroid's values)
  • distance - one of the distance metrics provided by the ML framework such as Euclidean, Hamming or Manhattan
  • seed - one of initialization parameters which helps to reproduce models (trainer has a random initialization step to get the first centroids)
// Set up the trainer
KMeansTrainer trainer = new KMeansTrainer()
   .withDistance(new EuclideanDistance())
   .withK(AMOUNT_OF_CLUSTERS)
   .withMaxIterations(MAX_ITERATIONS)
   .withEpsilon(PRECISION);

// Build the model
KMeansModel knnMdl = trainer.fit(
  datasetBuilder,
  featureExtractor,
  labelExtractor
);

Example

To see how K-Means clustering can be used in practice, try this example that is available on GitHub and delivered with every Apache Ignite distribution.

The training dataset is the subset of the Iris dataset (classes with labels 1 and 2, which are presented linear separable two-classes dataset) which can be loaded from the UCI Machine Learning Repository.

K-Means Clustering


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.