How long will a queue manager restart take
- in the best case?
- in the worst case?
and what can I do to improve it?
It‚Äôs nice in general administration and planning to have an idea of how long your queue managers will take to start up – for example if you intend to shut one down for a brief maintenance window. However, perhaps with increasing frequency, this is a critical question as part of High Availability configurations ‚Äď be that the MQ Appliance HA feature, Multi-instance queue managers, or many other (generally platform specific) options.
There are a few ways to answer this question:
- The simple but inaccurate answer ‚Äď ‚Äúa few seconds to a minute‚ÄĚ.
- The accurate but unhelpful answer ‚Äď ‚ÄúIt depends‚ÄĚ.
- The full answer ‚Äď (‚Äúon what does it depend‚ÄĚ) ‚Äď the purpose of this article.
To get us going on that full answer, we need to agree some terminology ‚Äď (most of this is probably familiar to experienced MQ users, but just in case):
Log ‚Äď almost all persistent messages which flow through MQ will at some point be written to ‚Äėthe log(s)‚Äô. This is not the human readable files you might be used to looking at to find information or error messages about the queue manager, but a rolling record of everything the queue manager needs to know to restart in the event of failure.
Queue file ‚Äď the queue manager will regularly ‚Äėcheckpoint‚Äô, writing most messages on queues into neatly organised files for replay. This means it doesn‚Äôt need to walk through all the logs at startup time, but can just load these queues directly back into memory with all message data required. Notably on a clean shutdown a checkpoint will always be taken and all messages should be in these files.
In-flight transaction ‚Äď any messaging activity started but not yet committed ‚Äď most obviously if an application has started a unit of work and sent or received some messages, but this could also be for example an in doubt channel which has not yet been resolved. In flight transactions will always need some degree of ‚Äėreplay‚Äô or ‚Äėrollback‚Äô according to the content of the log files when we start the queue manager.
We need to agree on one more thing as well: what do we mean by ‚Äėstarted‚Äô? We could usefully mean we have reached one of three milestones:
a) The queue manager has resolved all in-flight transactions and begun loading queues. At this point, queues will begin to be available for use by applications (some may take longer than others though, depending for example on queue depths).
b) All queues are loaded and all applications which were doing work can (theoretically) continue with their processing.
c) All applications which were connected to the queue manager are reconnected.
In general, the time spent reaching state (b) will be a relatively brief window – as even for deep queues, with modern disk systems, memory and processors it is unlikely to take more than a few seconds to load what is required.
The time spent reaching state (c) can be longer if there are very large numbers of clients. This will be somewhat dependent on infrastructure in the network and the local system network ‚Äėstack‚Äô, as well as handling of concurrent connections in the queue manager. In the extreme case of 10s of thousands of reconnecting clients, if the clients attempt to reconnect quickly (and assuming there is not a significant amount of work to do in the client application itself) this is usually seen to be in the order of seconds or low numbers of minutes before all clients are (re)connected. Optimising this (for example, making sure channels are restarted in an order which allows the most critical applications to reconnect first) may be very important to the overall architecture, but is slightly outside the direct ‘queue manager restart’ question.
The most interesting/variable phase is therefore (a) ‚Äď resolving in flight transactions ‚Äď essentially ‚Äėlog replay‚Äô. This is the focus of the rest of this article.
What will affect replay/rollback time?
Any kind of computational performance varies of course dependent on the hardware used (see point 6 below). We won‚Äôt therefore talk in this section in absolutes, but try to make clear what the variables are and how you can predict and monitor the expected restart time of a particular environment.
Amount of data in the log.
This is really looking at the ‚Äėsymptom‚Äô rather than the ‚Äėcause‚Äô, but may be helpful when considering the upper bound on queue manager start time. When you configured your queue manager, you decided on the size and number of log files (strictly, primary and secondary log files). In the worst case, every single one of these logs could be full of transaction data which we need to replay. However, in reality, because of the regular checkpoints which are taken, only some subset of these logs will actually be required at restart time.
Some enhancements in this area in MQ 9.0.2 onward make it easier to understand what is happening here under the covers. MQ will usually attempt to checkpoint at least frequently enough to contain the ‚Äėactive‚Äô log in the primary log files. The LOGINUSE, RECSZ and LOGUTIL attributes of QMSTATUS can be used to see if this is working, and how much transaction data is really being kept in the logs at a point in time or on an estimated ‚Äėrolling‚Äô basis. If you have large log files which are mostly in use, or are using a lot of secondary logs, this indicates a lot of work will be needed at startup time.
If you are trying to establish the ‚Äėworst case‚Äô startup time for your hardware configuration, you could therefore create a scenario which generates as many ‚Äėactive‚Äô log pages as possible, (lots of messages in flight under uncommitted transactions) and then simulate a failure and measure restart time. Conversely, if you want to ensure there can never be more than a certain amount of data to replay at startup, you can create your queue managers with smaller or fewer logs (this of course has implications for the size and number of transactions which can ever be simultaneously in flight on the queue manager.)
‚ÄėActive‚Äô Queue depths
We said earlier that loading a deep queue into memory (from a nicely organised queue file) is not necessarily very expensive. However, if you have a combination of a very deep queue and many outstanding log records to play against that queue, rebuilding the queue from the logs may take much longer.
As stated previously, checkpointing will be scheduled regularly and automatically. This can be hindered by, for example, applications holding locks on objects for extended periods (see below) or I/O issues in writing to queue files. Problems with checkpoint frequency will be visible both in increased log utilisation and as warning messages in the error logs.
If applications are designed to frequently ‚Äėindex into‚Äô queues (also known as the using-messaging-as-a-database anti-pattern), as well as significantly affecting normal performance both checkpoint frequency and log replay time may be negatively affected.
Long running or ‚Äėlarge‚Äô transactions
At checkpoint time, the ‚Äėactive logs‚Äô will shrink until they only go back in time to (approximately) the point where the oldest active transaction began. Clearly, that means that long running transactions or transactions with a lot of outstanding work will mean larger log files to replay.
It’s worth noting that in normal operation, MQ spends very little time reading from disk, and almost all the time writing to disk – it is carefully designed in fact to maintain a steady stream of writes which will have good performance characteristics on typical hardware. At restart time however this is completely reversed, and performance will depend vastly upon read characteristics of your disk I/O subsystem. You may wish to use platform specific tools to measure/optimise these characteristics for your environment.
Estimating queue manager restart time
As mentioned above, the theoretical worst case will closely relate to the maximum configured log file space ‚Äď as a rule of thumb, consider the time it would simply take to read all the data from those log files into memory on your system.
In reality, there is no substitute for testing in your environment. By creating ‚Äėbadly behaved‚Äô applications you can see for yourself what effect a large active log file has on your system ‚Äď this will give you an indication as to whether in your environment worst case restart times are in the milliseconds or hours!
On modern hardware and with ‚Äėreasonably‚Äô sized logs, it would be unusual for the restart process to take more than a minute, and if seen this may indicate an underlying issue (such as a problem in I/O performance).
Minimising queue manager restart time
So with all of that in mind, what are the best actions to take to minimize start/restart times for a queue manager?
- Configure your primary and secondary logs appropriately ‚Äď see this page in the MQ documentation for guidance. It may be tempting to make the queue manager logs extremely large, but this implicitly allows for longer restart times in the worst case. (9.0.2 onward) monitor LOGUTIL to see the actual usage.
- Constrain applications to (agreed) ‚ÄėSLAs‚Äô ‚Äď for example, configure MAXDEPTH on queues and MAXUMSGS on the queue manager appropriately to prevent extreme build up of messages or overly large transactions.
- Monitor queue depths for early warning when you may be heading for trouble.
- Similarly, DISPLAY CONN can identify applications with long running transactions.
- Whenever possible cleanly shutdown the queue manager to allow checkpointing (at least an ‚Äėimmediate‚Äô shutdown, even if applications are to be terminated instantly).
- Advise application developers to follow best practices (avoiding long running transactions and ‚ÄėMQ as a database‚Äô for example).