Apache Ignite Documentation

GridGain Developer Hub - Apache Ignitetm

Welcome to the Apache Ignite developer hub run by GridGain. Here you'll find comprehensive guides and documentation to help you start working with Apache Ignite as quickly as possible, as well as support if you get stuck.

 

GridGain also provides Community Edition which is a distribution of Apache Ignite made available by GridGain. It is the fastest and easiest way to get started with Apache Ignite. The Community Edition is generally more stable than the Apache Ignite release available from the Apache Ignite website and may contain extra bug fixes and features that have not made it yet into the release on the Apache website.

 

Let's jump right in!

 

Documentation     Ask a Question     Download

 

Javadoc     Scaladoc     Examples

Distributed Training

State of this feature

This feature depends upon the IGFS plugin. For further details, please see the documentation page.

Overview

Distributed training allows computational resources to be used on the whole cluster and thus speed up training of deep learning models. TensorFlow is a machine learning framework that natively supports distributed neural network training, inference and other computations. The main idea behind the distributed neural network training is the ability to calculate gradients of loss functions (squares of the errors) on every partition of data (in terms of horizontal partitioning) and then sum them to get the loss function gradient of the whole dataset:

Using this ability, we can calculate gradients on the nodes the data are stored on, reduce them and then finally update model parameters. This avoids data transfers between nodes and so prevents network bottlenecks.

Apache Ignite uses horizontal partitioning to store data in a distributed cluster. When we create an Apache Ignite cache (or table in terms of SQL), we can specify the number of partitions the data will be partitioned on. For example, if an Apache Ignite cluster consists of 10 machines and we create a cache with 10 partitions, then every machine will maintain approximately one data partition.

Distributed Training in TensorFlow on Apache Ignite is based upon the standalone client mode of distributed multi-worker training. Standalone client mode assumes that we have a cluster of workers with started TensorFlow servers and we have a client that actually contains model code. When the client calls tf.estimator.train_and_evaluate TensorFlow uses a specified distribution strategy to distribute computations across workers so that the most computationally intensive part performs on workers.

Standalone client mode on Apache Ignite

In case of TensorFlow on Apache Ignite, one of the most important goals is to avoid redundant data transfers and utilize data partitioning which is a core concept of Apache Ignite. Apache Ignite provides so called Zero ETL. To achieve this goal, TensorFlow workers are started and maintained on the nodes the data are stored on. The following diagram illustrates this idea:

As we can see in the diagram, the MNIST Cache in an Apache Ignite cluster is distributed across 8 servers (1 partition per server). In addition to maintaining the data partition, each server maintains a TensorFlow worker. Each worker is configured to have access to its local data only (this “stickiness” works through a setting of environment variables).

Unlike classic standalone client mode in TensorFlow on Apache Ignite, the client process is also started inside an Apache Ignite cluster as a service. This allows Apache Ignite to automatically restart training in case of any failure or after data rebalancing events.

When initialization is completed and a TensorFlow cluster is configured, Apache Ignite doesn’t interfere with TensorFlow work. Only in the case of failures and data rebalancing events, Apache Ignite restarts a cluster. During the normal mode of operation we can consider the whole infrastructure as shown in the following diagram:

Updated about 5 hours ago

Distributed Training


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.