Updated 12 April 2016

ODM Advanced – Decision Server Insights is designed for high availability. When properly configured, DSI can endure the loss of any single hardware node or software server in the topology, and continue to operate normally.

Most of the time it’s possible to recover from error situations without taking the cluster down and interrupting processing. This document discusses which portions of a DSI Topology need to be restarted to recover from various error situations. The goal is to restart the smallest portion of the topology to recover from a problem, and no more. In some cases it may be necessary to stop the runtime Cluster, and start it again with preload from persisted data.

Sequentially shutting down and starting the servers in a DSI topology is a good method to upgrade each server, while continuing to process events. However, a so-called “ripple restart”, is not an effective to recover from error situations, and it may actually introduce additional problems.


TopologyHA

Background

Other than the backing database, a DSI Topology Consists of 4 types of servers. Each server should be a separate Liberty instance. (Requirements for high availability are listed.)

  • Catalog servers (2 or 3 servers)
  • Inbound Connectivity servers (1 more server than required to handle the load)
  • Outbound Connectivity servers (1 more server than required to handle the load)
  • Runtime servers (1 more server than required to handle the load, minimum 3 servers, preferably at least 4 servers)

To achieve high availability runtime servers must be distributed across at least 3 nodes – preferably 4, and all other server types must be distributed across at least 2 nodes. Also, catalog servers and runtime servers must not be hosted on the same node. (“Server” refers to an instance of Liberty, and “node” refers to the computers on which the servers run. See Configuring a DSI Topology for Testing or Production for more information.)

The set of runtime servers constitute the DSI “cluster”. Connectivity servers may be co-located on the same node with either a runtime server or catalog server. For simplicity and to preserve PVUs for event processing, it’s best to co-locate each catalog server instance on the same nodes with connectivity server instances.

WebSphere Extreme Scale is used to distribute entity and other data across the cluster in a partitioned grid. Many partitions may be placed in each runtime server, and each partition is replicated both synchronously and asynchronously to nodes hosting other runtime servers. Replication enables fail-over after server failures, and recovery with a minimum of disruption to event processing. WXS developmentMode setting must be set to false for HA.

DSI recovers from many errors automatically and without manual intervention. If an event being processed encounters an error, processing will be rolled back and retried. If there is an underlying problem in the event data, or if the processing of the event is not successful after multiple retries, the event is logged and abandoned. Processing of other events is unaffected.

General

If there are recurring errors on just one server, the general guidance is to restart just that server. In this case rebooting the node hosting that server is preferable to just restarting Liberty. This is to prevent a problem from cascading to another server; plus it can clear operating system problems, e.g. in network communication. You could shut down the Liberty server before rebooting, but when a server is already having problems, it may not cleanly shut down. If you have multiple servers per node, restarting all the servers on that node is OK, provided that you have followed the HA guidelines for co-locating servers.

For situations where the error is not isolated to a single server, the proper action depends on the situation.

Catalog Server Problems

At any point in time one catalog server is the master. Only the master catalog server is required for DSI to run. If the master goes down, another one becomes the master. Catalog servers are quite stable. They usually do not exhibit problems, unless there are network problems.

If there is only one catalog server functioning, and it’s having problems, you can still restart it. The cluster can survive for a short time with no catalog servers at all. While all catalog servers are down, new events will not be able to be queued for the runtime.

Sequentially restarting all catalog servers should not be required, but if you do that, wait at least 10 minutes between each. Except when shutting down the entire topology, you should not restart catalog servers at the same time that you are restarting runtime servers.

Connectivity Server Problems

Because connectivity servers are loosely coupled, individual servers do not normally exhibit problems that affect all. If multiple connectivity servers are exhibiting the same problem, you should check that your endpoints and the runtime cluster are available and accessible.

If you think that there is a common problem across the connectivity servers, you can restart them sequentially. (For inbound connectivity this assumes that your load balancing for HTTP or JMS is properly configured.) Be sure to wait for each server to come fully online, before restarting the next.

If only one inbound or outbound connectivity server is operating, you may restart it without restarting the rest of the topology.

  • During the time that there are no outbound connectivity servers operating, emitted events will be queued in the DSI runtime cluster. Those events will be sent to the appropriate endpoints, when the outbound connectivity servers are running again.
  • During the time that there are no inbound connectivity servers operating, new events will no longer reach the DSI runtime cluster. The effect on events that occur during the downtime depends on the type of connection. When a durable JMS connection is used, the events will be queued in the JMS provider, up to the storage capacity of the provider. For events sent through HTTP connectivity, the original source of events must continue to retry during the time that inbound connectivity is down. This will likely stall the process that is sending the events.

Outbound servers are up, when they logs the endpoints in their configuration. Inbound servers are up, when they begin processing messages.

Runtime Cluster Problems

Unlike the other types of DSI servers, the runtime servers are tightly coupled. Because they jointly form a memory grid for solution data, problems can sometimes cascade across the cluster.

As with the other types of servers, if a problem is only happening on a single server, just reboot the node containing that server.

If simultaneous problems are occurring on multiple runtime servers, especially grid problems, it’s likely that you will need to restart the entire cluster. A sequential restart should be avoided, as it is very unlikely to address a multiple runtime server problem. It will just prolong the downtime. Note that if persistence is not enabled, all application data will be lost, when the cluster is restarted.

Since a cluster restart will cause a period of downtime, you want to be sure that it’s really necessary. If there are other problems affecting multiple runtime servers, you’ll want to be sure that event processing is really impaired, before shutting down the topology. (Use the Insights Monitor to see if servers are processing events.) However, if there are problems with the grid, e.g. with partition placement, you will most likely need to shut down the entire cluster.

If the cluster is still operating with errors, you can continue to operate until a convenient time to restart the cluster. If one server is exhibiting most of the problem, you can shut down the node containing that server immediately, and leave it down until a convenient time to restart the entire cluster.

When you need to shutdown the runtime cluster, you should consider shutting down and restarting the whole topology. It will take a little longer, but it’s the more conservative approach, in the case that a problem in the runtime cluster has caused problems in other servers.

If the source of events or the event queue is able to durably buffer events, events should not be lost while the runtime cluster is restarting, but the queue or event source may not have the capacity to buffer events for very long.

Topology Shut Down

To shut down the entire topology, suspend partition movement using the serverManager suspendBalancing command, then shut down the servers in the order below. Make sure that all the servers of one kind are completely shut down, before going onto the next type. Since some problems may interfere with a server cleanly shutting down, it may be necessary to shut down the OS, which also can help if the problem like a network issue originated in the OS.

  1. Inbound Connectivity (server stop; kill JVM, if necessary)
  2. Outbound Connectivity (server stop; kill JVM, if necessary)
  3. Runtime (serverManager shutdown, one at a time, if possible, then optionally shut down OS)
  4. Catalog (server stop, if possible, then optionally shut down OS)

After all of the servers have been shut down, start them up following the instructions below.

Topology Start Up

To prevent too much partition placement during startup, the WXS numInitialContainers setting should be set somewhere between 3 and the number of runtime servers in your topology. Note that when the runtime servers are started they will go into preload. At this point you can use “dataloadManager autoload” to load the data from the database to the grid as documented in the Knowledge Center. You should wait for a “Solution is ready” or “System preload has completed” messages before starting the connectivity servers.

  1. Catalog (boot OS, if necessary, then server start)
  2. Runtime (boot OS, if necessary, then server start)
  3. Outbound Connectivity (boot OS, if necessary, then server start)
  4. Inbound Connectivity (boot OS, if necessary, then server start)

Summary

  • If you are having problems on a single server or node, restart just a that server or node.
  • If it’s possible, reboot the whole node, so it can clear O/S or other errors that would not be cleared by restarting just the server.
  • Sequentially restarting all inbound connectivity, outbound connectivity, or catalog servers is possible, but rarely needed.
  • Sequentially restarting runtime servers is not an effective method to recover from errors – instead shutdown everything, start everything, and preload.
  • Don’t sequentially restart all types of servers in your topology, except for upgrade and maintenance.

1 comment on"Recovering from Problems in a DSI Topology"

Join The Discussion

Your email address will not be published. Required fields are marked *