How can client applications find their queue manager following failover?
When a queue manager restarts or relocates – through any means, such as the appliance HA or DR functionality – applications need to reconnect to the ‘new instance’ of the queue manager. This may be an explicit reconnect in application code or through human intervention, or may be handled by the auto-reconnect features of the MQ client library.
Whichever mechanism is used, the client will need to ‘find’ the new running queue manager. The simplest mechanism for achieving this on the MQ Appliance in High Availability groups is probably ‘floating IP’ – assigning an address to the queue manager which will redirect to the new location if it moves on restart.
However, in a disaster recovery configuration there is deliberately no automatic failover to the DR instance, and no support for floating IP addresses (which would potentially be much more complicated anyway in a cross-site deployment). Here we consider a couple of approaches to managing the client connectivity to a DR instance of a queue manager.
There are many approaches to Disaster Recovery. Some deployments attempt to mirror the live configuration exactly – which implies it must stay ‘cold’ and never connect with the live running instance to avoid confusion between live and recovery data. Other deployments maintain a ‘warm’ or even ‘hot’ DR site, which may even interact with the live applications, but can therefore not precisely mirror those live servers. Broadly speaking, the two options described here reflect these two approaches, but there are many others (including compromises between the two), and in particular differing philosophies in testing DR capabilities – a full discussion of this is out of scope of this article, but hopefully the high level overview of the two will be a good jumping off point for designing the right fit in your environment.
Note 1) The discussion is much the same whether HA and floating IP are in use at the live site or not – although FIP enforces that an IP will be configured per queue manager, which is not necessarily the case otherwise. If two appliances are available at the disaster recovery site, you may wish to restore an HA configuration following failover – in which case again, floating IP addresses might be used for the individual queue managers.
Note 2) Bear in mind that ‘locating’ the queue manager is only part of the process of reconnecting. While auto-reconnection in the MQ client libraries can hide much complexity of a failover for HA, this will not necessarily be appropriate for disaster recovery scenarios with other external dependencies, and always comes with certain caveats. See for example https://www.ibm.com/support/knowledgecenter/en/SSFKSJ_9.0.0/com.ibm.mq.con.doc/q018380_.htm which is written primarily regarding multi-instance queue managers, but applies equally to appliance HA. If data has been lost following a failover, always a possibility with asynchronous replication, then application recovery logic may be further complicated.
Note 3) Another major variable is whether applications are expected to ‘persist’ across a DR failover, or DR instances of the applications will be started. Often the answer is a mixture of the two as in the illustration.
Option 1 – ‘non identical, but warm’ Disaster Recovery site
What does it look like?
- Network addressing is different between live and DR appliances (likely different subnets etc.)
- Ideally, DNS names are configured for each QM IP.
- Client applications are configured to connect to the live queue manager IP address (or, preferably DNS name). These might be the same (‘global’) clients, or might be ‘DR instances’.
What happens following a disaster:
- ‘DR instances’ of applications are preconfigured for DR addresses
- Global applications must either be aware of DR address (CCDT, comma separated list), or be using a DNS entry.
- If DNS is not configured to failover to a DR ‘version’, this will need updating (and may need flushing through).
- Minimal action needed at failover time
- Fewer configuration changes means fewer potential error cases
- Testing can be performed at any time by separating DR from live, and resynchronising when test complete.
- DR site configuration not identical to Production.
- Application deployments may need to differ between sites.
Option 2 – ‘identical, but cold’ Disaster Recovery site.
What does it look like?
- Network configuration/labelling the same at both sites
- For each queue manager address, a ‘secondary IP’ is defined on the DR box on the QM interface.
- THIS INTERFACE IS DISABLED in normal operation (at least at the network fabric, and probably on the appliance itself)
- i.e. the MQ traffic network at the DR site is preconfigured but ‘dark’
What happens following a disaster:
- Network Interface(s) are enabled on DR appliance
- All infrastructure reconfigured to enable routing to this subnet
- Clients (whether global or DR instances) reconnect to same IP address for each queue manager as previously used at live site
- DNS may be used or not to provide additional point of flexibility
- Configuration ‘more similar’ at DR and live sites
- Application code/connection details identical.
- More reconfiguration (human interaction required)
- More manual intervention could mean more opportunities for error.
Clearly this article can only scratch the surface of Disaster Recovery planning, but a key point is that it is vital to consider in advance which elements of the overall architecture are ‘shared’ between the live and DR environments, and which are not – as well as whether the same will apply in test scenarios. Understanding when a discrete view of your messaging infrastructure is desirable for separation, and when you would prefer things look as precisely identical as possible will help guide the mechanism you choose for managing failover of these connections.