Warning long append, but full of good information…
I was helping a customer with “an MQ performance problem” where application queues were filling up. I asked for an architecture diagram and other information. They did not have this documented and Joe spent many minutes drawing a complex picture on the white board: blue boxes for virtual machines, green boxes for MQ, red lines for data flow etc. I took a photo my mobile phone for reference.
I was disappointed that Joe who was responsible for the business application, knew so little about it. I felt like tying him into a chair, like in the old movies, shining a light in his face and saying “tell me all you know”… but I took the easier solution of expressing my expectations of a business architect over lunch.
I expected each major business application to have a document (a presentation is fine) containing enough information for people who are outside the application to be able to understand it
- What the application does – a one paragraph description
- How important it is to the business
- What depends on it – what does it depend on
- What the HA requirements are, and what the numbers mean. “No more than 10 seconds downtime a day” – is this for all clients? What happens if 10% clients get a 100 second outage – is this the same? Is a 5 minute total lack of service during a switch to the backup system acceptable?
- (And how often do you measure this switch over time to ensure it is achievable)
- The cost and business impact of an outage – this is useful when deciding which fire to fight.
- A picture showing the logical data flow
- Where does data come from – how many clients – what security is used for the clients eg SSL?
- Workload profile – is it one message per client – connect MQPUT,MQGET, MQDISC or do clients connect and stay connected all day?
- Into IIB … and here is the high level IIB flow description
- Where the single points of failure are – and what is done to reduce the impact
- Where are the performance bottlenecks (eg usually in the database requests)
- Has this picture been shared with the MQ infrastructure team? (get their names put on it)
- At what points in the flow is monitoring done to identify bottlenecks
- What monitoring is used – eg queue high event to spot queue depths increasing – and what actions are taken for each monitoring point.
- What dead letter queue and error queues are used.
- What applications process messages on these queues – eg DLQ handler
- Does the user id which processes this queue have only the minimum authority or can it put messages to other queues and so get round authority checks. For example someone has faked the DLQ header to say “queue full” for the “payments queue”. DLQ processing tries to re put the message to that queue.
- What monitoring is place for these queues – eg any message should produce an event
- If we want to shut down the MQ and this IIB to put maintenance on – does the business application keep running or is there an outage
- if I want to add a new MQ, IIB etc to provide more scalability and HA capabilities – is it easy to do? Or do the clients have hard coded IP addresses?
- Message volumes and sizes
- A profile over the day and over the week showing volumes by hour
- Projection of volumes change over the next year
- For every message coming in – how many are used internally. Include audit messages, and hops between message flows, and the reply messages
- What is the typical and maximum message sizes expected.
- Have these figures been discussed with the MQ infrastructure team?
- What security is used -for example controlling which groups of people( groups not individual user ids) can put to or get from the queues
- Does any application need to set message context?
- Documentation as to why key decisions were made. For example we used a separate Virtual Machine(VM) for MQ, and a separate VM for IIB, and a separate VM for BPM. because… or we used a single VM for MQ+IIB+BPM to eliminate moving messages around, and to create a cookie cutter deployment. We did not do,,, because… and a date and people involved. You need this so if you want to change things significantly you can go and discuss the changes with the people who originally made the decision. A decision that was good 10 years ago – may no longer be best practice.
- Why were non persistent messages used – because applications can resend if they do not get response within 5 seconds. Applications are designed to handle possible duplicate requests – ahh good answer
And the reason for the “MQ performance problem ” was at peak time each database requests to the remote data base took over half a second – and there were 12 requests per transaction – so 6 seconds per transaction!”
The next topic in this theme is What’s the difference between a C programmer and an application programmer?