Introduction

A queue manager keeps the record of transactions and persistent messages in its recovery log. It is very important that you keep this recovery log intact. Losing log data that is needed for restart recovery results in the queue manager not being able to restart.

A “cold start” is a mechanism where it might be possible to restart a queue manager whose recovery log integrity has been compromised because it does not have some or all of its recovery log. A cold start does NOT maintain queue manager data integrity and unpredictable results can occur.

Some background

By default, the recovery log is written to…

  • /var/mqm/log on Unix and linux platforms
  • C:\ProgramData\IBM\MQ\log on Windows
  • journal receivers on iSeries

The recovery log is very different to the error log – although both are casually referred to as “the log”. The error log is where MQ outputs error, warning and informational messages, so it can be safely deleted (if you no longer want those error messages) and the queue manager will restart. However the recovery log contains the queue manager’s critical data and is fundamental to the integrity of the queue manager’s persistent data.

The queue manager uses write-ahead logging which means it writes to the recovery log using a forced write. In contrast, it either writes messages to queue files lazily, or transiently stores the data in a memory buffer. Using a forced write when writing to the recovery log guarantees that messages are persisted to disk when the MQPUT returns, or when the MQCMIT returns if putting inside a transaction. The persistent message data is, at best, written lazily to the queue files, so it may still be in a buffer after your application has finished. If the queue manager ends abruptly and is then restarted, the message data in queue files may be inconsistent or incomplete and so cannot be relied on. Nevertheless the data in the recovery log is reliable and is replayed as needed when the queue manager is restarted to restore the persistent data to the correct logical state.

Whilst running, the queue manager will checkpoint from time to time. Any message data that was written lazily to queue files at the start of a checkpoint will have persisted to disk by the end of the checkpoint. When the queue manager subsequently restarts, it only has to replay the recovery log that was written since the start of last checkpoint because any data written before that will be in the queue files. If there were transactions in-flight that were started before the last checkpoint, the queue manager will need access to the recovery log from the beginning of the earliest in-flight transaction. Checkpoints limit the amount of recovery log that the queue manager needs to replay on restart. The log extents written since the last checkpoint (or beginning of earliest in-flight transaction) are known as the log extents needed for restart recovery.

What state your queue manager may be in once it’s coldstarted

If you lose some or all of the log extents needed for restart recovery, the queue manager will be unable to replay the recovery log and so it will fail to restart. Customers who are unfortunate enough to find themselves in this unhappy position sometimes require that their queue manager restarts in some form, at the expense of maintaining data integrity. Restarting a queue manager whose recovery log is corrupt in some manner is known as coldstarting a queue manager. On coldstart, the queue manager creates an empty recovery log and relies on the data in the queue files and other object files in their existing state. Because the data in the queue files may be inconsistent, messages may be lost, duplicated, corrupted or inconsistent.

The queue manager stores the configuration of all the other persisted objects in the recovery log as well as in object files. Other internal state data is also recorded in the recovery log as well. So on coldstart, internal state data is reset and all this other configuration data may be inaccurate.

The effects of coldstart are unpredictable and wide-ranging so it is best to avoid coldstart unless absolutely necessary. After coldstarting, the information in the queue and object files may be so inconsistent that the queue manager will not restart at all. Even if it does restart, there is no simple way of discovering what message data or configuration can be relied on and what cannot. After a coldstart, queues might be damaged and so become completely unusable. Even if you can get from or put to a particular queue, the messages on it might be corrupt, missing or duplicated. Transactions and channels might be stuck in-doubt. Even if you are lucky and your queue manager coldstarts successfully and the queues look intact, the unpredictable effects of the coldstart might not be realised until much later.

IBM does not support customers running on a queue manager that was previously coldstarted.

What to do if you need to coldstart

IBM strongly discourages customers from coldstart and if you are sadly in a position where you definitely need to coldstart a queue manager, please contact IBM MQ Support https://www.ibm.com/mysupport/s/topic/0TO5000000024cJGAQ/ibm-mq?language=en_US and they will guide you through the process. The process for coldstarting a queue manager used to be much more complicated for a linear queue manager than circular. So in v9.1.3, the coldstart process is much simplified and does not involve copying or renaming log extents any more so it supercedes the previously more complicated process. In v9.1.3, IBM Support will give you a key which you pass to strmqm to coldstart. Unfortunately the v9.1.3 coldstart command still carries the same risks of losing data integrity and you are still not fully supported once you have coldstarted.

Eliminating future cold starts: a request

The strmqm command requires a key to coldstart because we would like customers to contact IBM Support if they need to coldstart, as we are keen to understand how you got into this situation. Clearly coldstart is something that is best avoided and MQ has gone to considerable effort to make sure that customers will not need to. Please tell us how you got into this situation so we can discover whether there is anything else that MQ could do to avoid customers having to coldstart.

Precautions to avoid coldstart

The default logging method when creating a queue manager is circular logging. With circular logging you allow the queue manager a particular number of primary and secondary log extents of a given size. Create your log filesystem big enough to contain all the primary and secondary log extents, and you can forget about them – you should never need to administer them. Alternatively you can use linear logging as opposed to circular. Linear logging gives you the added ability to recover queues and other objects, in the unlikely event that they become damaged. But by default, linear logging requires you to delete log extents that are no longer needed for restart or media recovery. This is referred to as manual log management. When administering log extents in this way, it is possible to inadvertently delete too many log extents and so end up having to coldstart. To mitigate this risk, MQ added automatic log management in v9.0.2 which I wrote about in my previous blog “Logger enhancements for MQ v9.0.2 and v9.1” https://developer.ibm.com/messaging/2018/08/28/logger-enhancements-mq-v9-0-2/

Best practice is to put your recovery log in a separate log filesystem which only contains the recovery log. Customers who put their recovery log in the same filesystem as the rest of their queue manager sometimes find that filesystem accidentally filling up perhaps due to large queue files. If the filesystem holding the queue files fills, you may not be able to put to those queues, but the queue manager continues running. If the filesystem containing the recovery log fills, the queue manager will end abruptly and will not restart until you free up some space. But be careful not to delete log extents needed for restart recovery otherwise you may find yourself needing to coldstart. So best practice is to put the recovery log in a separate filesystem. Either make the queue manager’s log directory a separate filesystem, or specify a different log filesystem using the -ld command line option to crtmqm. How large to make your log filesystem is documented in https://www.ibm.com/support/knowledgecenter/SSFKSJ_9.0.0/com.ibm.mq.con.doc/q018470_.htm . When using the IBM MQ Appliance or an RDQM queue manager, the recovery log and queue files cannot be put into separate filesystems/volumes. This does increase the risk of large queue files filling the filesystem that contains the recovery log, so extra care should be taken when using RDQM and the IBM MQ Appliance. Sometimes customers also find they need to coldstart because the disk failed that contains their recovery log. Best practice is to put the recovery log on a replicated disk and so mitigate the risk of a disk crash.

Keep a note of which queue managers have been previously coldstarted, even if they were coldstarted a long time ago and have been stopped, restarted and/or migrated in the meantime. When reporting problems to IBM, it is helpful to mention if the queue manager was previously coldstarted. Moving your messages and configuration to a new replacement queue manager will avoid the possibility of ongoing problems with a queue manager that has been previously coldstarted and will get you back to being supported.

Conclusion

In conclusion, keep your MQ recovery log safe – either by using a circular log or automatic log management, written to a replicated disk. Do not coldstart unless absolutely necessary as doing so loses data integrity and unpredictable results can occur. If you think you need to coldstart, contact IBM Support and they will guide you through the new process for v9.1.3.

4 comments on"About coldstart – or why MQ recovery logs must be kept safe"

  1. how about RDQM? I do not know a way to separate Log and Q content to different volumes anymore 🙁

    • MarkWhitlock August 15, 2019

      Thank you for the question. Using an RDQM queue manager you cannot separate the recovery log and queue files into separate filesystems/volumes. Having the recovery log and queue files in the same filesystem/volume is an architectural limitation and a trade off when using RDQM. This does increase the risk of large queue files filling the filesystem that contains the recovery log, so extra care should be taken when using RDQM. I will update the blog to make this clear.

  2. The link “please contact IBM MQ Support https://www.ibm.com/support/home/product/P439881V74305Y86/IBM%20MQ” doesnt work any more.
    LMD.

Join The Discussion

Your email address will not be published.