One of the themes of the V9.1.x continuous delivery (CD) releases has been delivering improvements in queue manager startup, recovery and fail-over times. V9.1.1 saw the introduction of parallel queue loading, and improvements to recovery log processing that particularly benefited multi-instance queue manager (MIQM) installations, where the NFS file system has a higher latency than locally hosted recovery logs. You can see some examples of those improvements in my previous blog : https://developer.ibm.com/messaging/2019/03/12/queue-manager-restart-times-in-v9-1-1/
In V9.1.2 we focused on the processing of client disconnections, in particular, the time it takes to disconnect significant numbers of clients, when switching queue managers or processing the fail-over of a queue manager. Results for three of the tests are shown below for MIQM and RDQM topologies. Note that these results show the cumulative benefits of performance improvements that were delivered in V9.1.1 & V9.1.2.
|S1000||Switch-over with 1000 applications running at a set rate of 1 Put/Get per sec (total 1000 round trips/sec)|
|S500U||Switch-over with 500 applications running un-rated (MIQM total rate ~50,000 Put/Gets per sec. RDQM total rate ~85,000 Put/Gets per sec).|
|FN500U||Fail-over (network) with 500 applications running un-rated (MIQM total rate ~50,000 Put/Gets per sec. RDQM total rate ~85,000 Put/Gets per sec). NFS leasetime reduced to 10secs.|
These tests, run in the Hursley lab environment, show a dramatic drop in the time taken to switch over from a primary/active queue manger to a secondary/passive queue manager. Both RDQM and MIQM scenarios benefit from the work done in MQ V9.1.2, with MIQM queue manger re-start time dropping to under a second (since MIQM has a passive queue manager ready to accept new connections, once the active queue manager has disconnected all its clients and ended).
The network fail-over case above is dependant on tuning some parameters. For both MIQM and RDQM scenarios the HBINT channel attribute was set to 10 seconds. This means that the clients will time-out from an API call after 20 seconds, if there is no heartbeat signal from the queue manager during that time.
The NFS fail-over tests is dependent on tuning down the NFS lease time. If the network connection is lost, the active queue manager will not release teh NFS lock, so the standby queue manager
|FN500U||Fail-over (network) with 500 applications running un-rated (MIQM total rate ~50,000 Put/Gets per sec). NFS leasetime reduced to 10secs.|
|FN500U_NFS90||Fail-over (network) with 500 applications running un-rated (MIQM total rate ~50,000 Put/Gets per sec). NFS leasetime left to default (90 secs).|