One of the themes of the V9.1.x continuous delivery (CD) releases has been delivering improvements in queue manager startup, recovery and fail-over times. V9.1.1 saw the introduction of parallel queue loading, and improvements to recovery log processing that particularly benefited multi-instance queue manager (MIQM) installations, where the NFS file system has a higher latency than locally hosted recovery logs. You can see some examples of those improvements in my previous blog : https://developer.ibm.com/messaging/2019/03/12/queue-manager-restart-times-in-v9-1-1/

In V9.1.2 we focused on the processing of client disconnections, in particular, the time it takes to disconnect significant numbers of clients, when switching queue managers or processing the fail-over of a queue manager. Results for three of the tests are shown below for MIQM and RDQM topologies. Note that these results show the cumulative benefits of performance improvements that were delivered in V9.1.1 & V9.1.2.

Test
    Description
S1000 Switch-over with 1000 applications running at a set rate of 1 Put/Get per sec (total 1000 round trips/sec)
S500U Switch-over with 500 applications running unrated* (MIQM total rate ~50,000 Put/Gets per sec. RDQM total rate ~85,000 Put/Gets per sec).
FN500U Fail-over (network) with 500 applications running unrated (MIQM total rate ~50,000 Put/Gets per sec. RDQM total rate ~85,000 Put/Gets per sec). nfsv4leasetime & nfsv4gracetime reduced to 10secs.

*unrated applications execute Put/Get requests at the maximum number/per second rather than a set rate (e.g. 1 per second).

All messages were 2KB, transactional & persistent.

These tests, run in the Hursley lab environment, show a dramatic drop in the time taken to switch over from a primary/active queue manager to a secondary/passive queue manager. Both RDQM and MIQM scenarios benefit from the work done in MQ V9.1.2, with MIQM queue manger re-start time dropping to under a second (since MIQM has a passive queue manager ready to accept new connections, once the active queue manager has disconnected all its clients and ended).

The network fail-over case above is dependent on tuning some parameters. For both MIQM and RDQM scenarios the HBINT channel attribute was set to 10 seconds. This means that the clients will time-out from an API call after 20 seconds, if there is no heartbeat signal from the queue manager during that time.

The NFS fail-over tests is dependent on tuning down the NFS lease & grace times from the default of 90 seconds to 10 seconds. If the network connection is lost, the active queue manager will not release the NFS lock, so the standby queue manager, will not get control of it until the NFS lease time has expired. NB: Setting the NFS lease & grace times will affect all applications using the NFS server, not just MQ, so this may not be appropriate in your environment.

Test
    Description
FN500U Fail-over (network) with 500 applications running unrated (MIQM total rate ~50,000 Put/Gets per sec). NFS leasetime reduced to 10secs.

FN500U_NFS90 Fail-over (network) with 500 applications running unrated (MIQM total rate ~50,000 Put/Gets per sec). nfsv4leasetime & nfsv4gracetime left to default (90 secs).

6 comments on"Improved Switch/Fail-over times in MQ V9.1.2"

  1. “The NFS fail-over tests is dependent on tuning down the NFS lease time. If the network connection is lost, the active queue manager will not release teh NFS lock, so the standby queue manager”

    Is there any downside to tuning down the NFS lease time? If not- then should this be documented as the IBM default?

    • The main consideration is that the NFS lease and grace time values are a system wide setting, so will affect all clients using NFS on that machine. This is noted in the RDQM performance report, I’ve now re-iterated that here. An acceptable value will depend on how the NFS server is being used generally, as well as by MQ. I set this low to illustrate the potential times achievable.

  2. release teh NFS lock.. spelling mistake

    for 912 what is 0.692 – is this average time… or maximum time? If they were putting at one a second, I would expect to see the maximum time of close to 1 second – just after a putget had happened

    • Thanks, corrected the typo.

      The 0.692 value is the average (range was 0.643 to 0.727). This is the time it takes from the MIQM switch command, or the iptables command (used to start dropping packets on a link), to the time the secondary/standby queue manager is available for clients to re-connect (asserted by the AMQ8024I IBM MQ channel initiator started. This is not bounded by the rate the clients are running at, except that a busier queue manager may have more reconciliation to perform.

  3. What does unrated mean?

    • An unrated application performs Put/Gets as fast as it can (i.e. there is no wait time between the completion of a Put/Get and the initiation of the next one). I’ve updated the text above accordingly.

Join The Discussion

Your email address will not be published.