One of the themes of the V9.1.x continuous delivery (CD) releases has been delivering improvements in queue manager startup, recovery and fail-over times. V9.1.1 saw the introduction of parallel queue loading, and improvements to recovery log processing that particularly benefited multi-instance queue manager (MIQM) installations, where the NFS file system has a higher latency than locally hosted recovery logs. You can see some examples of those improvements in my previous blog : https://developer.ibm.com/messaging/2019/03/12/queue-manager-restart-times-in-v9-1-1/
In V9.1.2 we focused on the processing of client disconnections, in particular, the time it takes to disconnect significant numbers of clients, when switching queue managers or processing the fail-over of a queue manager. Results for three of the tests are shown below for MIQM and RDQM topologies. Note that these results show the cumulative benefits of performance improvements that were delivered in V9.1.1 & V9.1.2.
|S1000||Switch-over with 1000 applications running at a set rate of 1 Put/Get per sec (total 1000 round trips/sec)|
|S500U||Switch-over with 500 applications running unrated* (MIQM total rate ~50,000 Put/Gets per sec. RDQM total rate ~85,000 Put/Gets per sec).|
|FN500U||Fail-over (network) with 500 applications running unrated (MIQM total rate ~50,000 Put/Gets per sec. RDQM total rate ~85,000 Put/Gets per sec). nfsv4leasetime & nfsv4gracetime reduced to 10secs.|
*unrated applications execute Put/Get requests at the maximum number/per second rather than a set rate (e.g. 1 per second).
All messages were 2KB, transactional & persistent.
These tests, run in the Hursley lab environment, show a dramatic drop in the time taken to switch over from a primary/active queue manager to a secondary/passive queue manager. Both RDQM and MIQM scenarios benefit from the work done in MQ V9.1.2, with MIQM queue manger re-start time dropping to under a second (since MIQM has a passive queue manager ready to accept new connections, once the active queue manager has disconnected all its clients and ended).
The network fail-over case above is dependent on tuning some parameters. For both MIQM and RDQM scenarios the HBINT channel attribute was set to 10 seconds. This means that the clients will time-out from an API call after 20 seconds, if there is no heartbeat signal from the queue manager during that time.
The NFS fail-over tests is dependent on tuning down the NFS lease & grace times from the default of 90 seconds to 10 seconds. If the network connection is lost, the active queue manager will not release the NFS lock, so the standby queue manager, will not get control of it until the NFS lease time has expired. NB: Setting the NFS lease & grace times will affect all applications using the NFS server, not just MQ, so this may not be appropriate in your environment.
|FN500U||Fail-over (network) with 500 applications running unrated (MIQM total rate ~50,000 Put/Gets per sec). NFS leasetime reduced to 10secs.|
|FN500U_NFS90||Fail-over (network) with 500 applications running unrated (MIQM total rate ~50,000 Put/Gets per sec). nfsv4leasetime & nfsv4gracetime left to default (90 secs).|