Warning long append, but full of good information…
I was¬† helping a customer with “an MQ performance problem” where application queues were filling up.¬† I asked for an architecture diagram and other information.¬† They did not have this documented and Joe spent many minutes drawing a complex picture on the white board: blue boxes¬† for virtual machines, green boxes for MQ, red lines for data flow etc.¬† I took a photo my mobile phone for reference.
I was disappointed that Joe who was responsible for the business application, knew so little about it.¬† I felt like tying him into a chair, like in the old movies, shining a light in his face and saying “tell me all you know”…¬† but I took the easier solution of expressing my expectations of a business architect over lunch.
I expected each major business application¬† to have¬† a document (a presentation is fine) containing enough information for people who are outside the application¬† to be able to understand it
- What the application does – a one paragraph description
- How important it is to the business
- What depends on it – what does it depend on
- What the HA requirements are, and what the numbers mean.¬†¬† “No more than 10 seconds downtime a day” – is this for all clients? What happens if 10% clients get a 100 second outage – is this the same?¬†¬† Is a 5 minute total lack of service during a switch to the backup system acceptable?
- ¬†(And how often do you measure¬† this switch over time to ensure it is achievable)
- ¬†The cost and business¬† impact of an outage – this is useful when deciding which fire to fight.
- A picture showing the logical data flow
- Where does data come from – how many clients – what security is used for the clients eg SSL?
- Workload profile – is it one message per client – connect MQPUT,MQGET, MQDISC or do clients connect and stay connected all day?
- Into IIB … and here is the high level IIB flow description
- Where the single points of failure are – and what is done to reduce the impact
- Where are the performance bottlenecks¬† (eg usually in the database requests)
- Has this picture been shared with the MQ infrastructure team? (get their names put on it)
- At what points in the flow is monitoring done to identify bottlenecks
- What monitoring is used – eg queue high event to spot queue depths increasing – and what actions are taken for each monitoring point.
- What dead letter queue and error queues are used.
- What applications process messages on these queues – eg DLQ handler
- Does the user id which processes this queue have only the minimum authority¬† or can it put messages to other queues and so get round authority checks.¬† For example someone has faked the DLQ header to say “queue full”¬† for the “payments queue”.¬† DLQ processing tries to re put the message to that queue.
- ¬†What monitoring is place for these queues – eg any message should produce an event
- If we want to shut down the MQ and this IIB to put maintenance on –¬† does the business application¬† keep running or is there an outage
- if I want to add a new MQ, IIB etc to provide more scalability and HA capabilities – is it easy to do?¬† Or do the clients have hard coded IP addresses?
- Message volumes and sizes
- A profile over the day and over the week showing volumes by hour
- Projection of volumes change over the next year
- ¬†For every message coming in – how many are used internally.¬† Include audit messages, and hops between message flows, and the reply messages
- ¬†What is the typical and maximum message sizes expected.
- Have these figures been discussed with the MQ infrastructure team?
- What security is used -for example controlling which groups of people( groups not individual user ids)¬† can put to or get from the queues
- Does any application need to set message context?
- Documentation as to why key decisions were made.¬† For example we used a separate Virtual Machine(VM) for MQ, and a separate VM for IIB, and a separate VM for BPM. because…¬† or we used a single VM for MQ+IIB+BPM to eliminate moving messages around, and to create a cookie cutter deployment.¬† We did not do,,, because…¬† and a date and people involved.¬† You need this so if you want to change things significantly you can go and discuss the changes with the people who originally made the decision.¬† A decision that was good 10 years ago – may no longer be best practice.
- Why were non persistent messages used – because applications can resend if they do not get response within 5 seconds.¬† Applications are designed to handle possible duplicate requests – ahh good answer
And the reason for the “MQ performance problem ” was at peak time each database requests to the remote data base took over half a second¬† – and there were 12 requests per transaction – so 6 seconds per transaction!”
The next topic in this theme is What’s the difference between a C programmer and an application programmer?