Apache Ignite Documentation

GridGain Developer Hub - Apache Ignitetm

Welcome to the Apache Ignite developer hub run by GridGain. Here you'll find comprehensive guides and documentation to help you start working with Apache Ignite as quickly as possible, as well as support if you get stuck.

 

GridGain also provides Community Edition which is a distribution of Apache Ignite made available by GridGain. It is the fastest and easiest way to get started with Apache Ignite. The Community Edition is generally more stable than the Apache Ignite release available from the Apache Ignite website and may contain extra bug fixes and features that have not made it yet into the release on the Apache website.

 

Let's jump right in!

 

Documentation     Ask a Question     Download

 

Javadoc     Scaladoc     Examples

Critical Failures Handling

Overview

Apache Ignite is a robust and fault tolerant system. But in the real world, some unpredictable issues and problems arise which can lead to an Ignite node inoperability and, therefore, can affect the state of the whole cluster. Such issues can be detected at runtime and handled accordingly using a preconfigured failure handler.

Critical Failures

The following failures are treated as critical:

  • System critical errors (e.g. OutOfMemoryError).
  • Unintentional system worker termination (e.g. due to an unhandled exception).
  • System workers hanging.
  • Cluster nodes segmentation.

System critical error is an error which leads to the system's inoperability. For example:

  • File I/O errors - usually IOException is thrown by file read/write operations. It's possible when Ignite persistence is enabled (e.g., in cases when no space is left or on a device error), and also for in-memory mode because Ignite uses disk storage for keeping some metadata (e.g., in cases when the file descriptors limit is exceeded or file access is prohibited).
  • Out of memory error - when Durable Memory fails to allocate more space (IgniteOutOfMemoryException).
  • Out of memory error - when a cluster node runs out of Java heap (OutOfMemoryError).

Failure handling

When Ignite detects a critical failure, it handles the failure according to a preconfigured failure handler. The failure handler can be configured as follows:

<bean class="org.apache.ignite.configuration.IgniteConfiguration">
    <property name="failureHandler">
        <bean class="org.apache.ignite.failure.StopNodeFailureHandler"/>
    </property>
</bean>

Ignite supports following failure handlers:

Class
Description

NoOpFailureHandler

Ignores any failure. It's useful for tests and debugging.

RestartProcessFailureHandler

Specific implementation that could be used only with ignite.(sh|bat). Process must be terminated using Ignition.restart(true) call.

StopNodeFailureHandler

Stops Ignite node in case of critical error using Ignition.stop(true) or Ignition.stop(nodeName, true) call.

StopNodeOrHaltFailureHandler

This is the default handler; it tries to stop the node. If the node can't be stopped, then the handler will terminate the JVM process.

Parameters:

  • boolean tryStop - if true then will try to stop node gracefully. Default - false.

  • long timeout - Stop node timeout. Default - 0.

Critical Workers Health Check

Ignite has a number of internal workers that are essential for the cluster to function correctly. If one of them is terminated, Ignite node can become inoperative.

The following system workers are considered mission critical:

  • Discovery worker - discovery events handling.
  • TCP communication worker - peer-to-peer communication between nodes.
  • Exchange worker - partition map exchange.
  • Workers of system's striped pool.
  • Data Streamer striped pool workers.
  • Timeout worker - timeouts handling.
  • Checkpoint thread - check-pointing in Ignite persistence.
  • WAL workers - write-ahead logging, segments archiving and compression.
  • Expiration worker - TTL based expirations.
  • NIO workers - base networking.

Ignite has an internal mechanism for verifying that critical workers are operational. Each worker is regularly checked whether it's alive and is updating its heartbeat timestamp. If either of the conditions is not observed for the configured period of time, the worker is regarded as blocked and Ignite will output that information to the log file. The period of inactivity is specified by the IgniteConfiguration.systemWorkerBlockedTimeout property (in milliseconds; the default value equals the failure detection timeout).

The situations when critical threads stop responding are considered critical errors but, by default, they are not handled by the configured failure handler. If you want the failure handler to handle these situations the way other critical errors are handled, set the ignoredFailureTypes property of the handler to an empty value, as shown below. The reason for the empty value is that, by default, this type of errors is added to the ignore list of the handler, and the only way to get it out of the ignore list is to empty the list.

<bean class="org.apache.ignite.configuration.IgniteConfiguration">
    
    <property name="systemWorkerBlockedTimeout" value="#{60 * 60 * 1000}"/>
    
    <property name="failureHandler">
        <bean class="org.apache.ignite.failure.StopNodeFailureHandler">
          
          <!-- uncomment to enable this handler to 
           process critical workers' hung-ups -->
          <!--property name="ignoredFailureTypes">
            <list>
            </list>
          </property-->
         
      </bean>
        	
    </property>
</bean>
 

Critical workers' liveness check can also be configured via the FailureHandlingMxBean JMX MBean.

Updated about a year ago

Critical Failures Handling


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.