Measures of performance typically include product speed in terms of processing rate and response times, and resource usage in terms of the CPU and memory consumed. In order to assess the performance of an application, metrics are used to compare the actual performance with the required or expected performance, using measures such as the number of messages per second, elapsed time, CPU utilisation, or CPU cost per message. A combination of system level and component level monitoring tools are required to successfully monitor your systems and determine the root cause of any issues that are observed.
- System level tools: Monitoring at the operating system level to observe system resource usage (CPU, memory, I/O) and heaviest resource users.
- Component level tools: Monitoring behaviour within a particular component (IBM Integration Bus, MQ, etc.)
This article describes typical use cases and associated configuration with respect to the accounting and statistics monitoring capabilities within IBM Integration Bus.
What accounting and statistics data should I collect and why?
The message flow statistics and accounting data collection is highly configurable ‚Äď enabling customisation dependant on monitoring requirements. There are two collection options:
- Snapshot data: Data is collected for an interval of approximately 20 seconds, at which point the recorded statistics are written to the output destination and the interval is restarted.
- Archive data: Data is collected for an interval of approximately 60 minutes by default, but this interval is configurable in the range of 1 through 43200 minutes.
In addition to the collection option you can also control what level of data is output depending on your specific requirements:
- Integration Servers: Enable data collection from a single or all integration servers
- Message Flows: Specify a particular message flow or enable for all message flows
- Nodes: Choose whether to exclude node related data, or include basic or advanced data
- Threads: Specify whether or not to include thread-specific data
This performance data can be used in a number of different use cases ranging from problem determination to capacity planning:
Accounting and statistics data can be utilised to help find the cause for slow response times or high CPU usage in a production environment. The WebUI ¬†in IBM Integration Bus v10 can be used to¬†start, stop and view snapshot data across all message flows in an integration server or for just an individual message flow. If an issue is identified in a production environment this can be used to take a closer look and identify the root cause. The data is typically of little value subsequently once the problem has been identified and so there is no need to archive, although a summary of it might appear in an incident report.
Although the overhead associated with monitoring is low (typically ~3%) enabling snapshot data for just the problem message flow or container helps to limit any impact¬†and reduce the amount of data being generated. If you are using¬†IBM Tivoli Composite Application Manager (ITCAM) or IBM Tivoli OMEGAMON XE for Messaging for z/OS they additionally provide a mechanism for you to enable and modify statistics data collection dynamically from a remote location as a “take action” command.
More detailed information on using accounting and statistics for problem determination in combination with system level tools can be found here:
Message flow profiling in development
Optimising performance whilst projects are in development helps to lower hardware requirements, maximise throughput, reduce response times and ensure your SLAs are met. It is best to design your message flows to be as efficient as possible from day one as resolving issues at a later stage costs much more, and working around bad design can make the problem worse.
The WebUI ¬†in IBM Integration Bus v10 can be used to¬†start, stop and view snapshot data across all message flows in an integration server or for just an individual message flow. There would typically be a number of rounds of profiling and enhancement to identify hotspots and make changes before re-evaluating to identify the next hotspot or bottleneck.
The data is typically of little value subsequently once the improvements have been made, so there is no need to archive.
It is crucial to ensure that the required production capacity is available to your integrations for expected workloads and anticipated changes in said workloads. System levels tools provide a view on the hardware resource usage¬†of production systems, this should be utilised alongside accounting and statistics data which will provide a breakdown of usage for your integrations. With this data it is possible to predict resource requirements for increased workloads. Comparisons of message flow CPU metrics¬†enables estimation of system CPU usage per message flow. This can be used to estimate additional system resource requirements if the workload for one in a set of integrations is expected to increase by a given factor.
This data needs to be collected alongside system resource metrics¬†over a period of weeks, months and possibly years to understand the workload requirements and characteristics; as such, with a longer and configurable collection interval, archive data is the recommended approach with at least message flow level data. The collection interval should be short enough to identify spikes in workload and resource usage, but long enough that too much data is not generated. Typically a collection interval of between 5 and 15 minutes is used, but data is compressed / aggregated over time to reduce storage requirements and products such as ITCAM or IBM Tivoli OMEGAMON¬†XE for Messaging for z/OS provide features to enable this.
Situation alert monitoring in production
Using monitoring tools (such as ITCAM and IBM Tivoli OMEGAMON XE for Messaging for z/OS) that utilise accounting and statistics data you can set up situation monitoring based on thresholds so that an alert is automatically generated if any metrics exceed their given threshold. This can be utilised to provide alerts when a message flow exceeds a given CPU or elapsed time expectation, or has a non-zero processing error, MQ error or back-out count for example. Node level data can also be utilised in a similar way, to alert if a request node invoking a 3rd party service takes longer than expected to respond for instance. And if you are collecting the most detailed data you can even be automatically alerted if a failure terminal is used – indicating that a message has flowed down a failure path of a message flow.
Additional products such as IBM Operations Analytics – Predictive Insights look to identify relationships between accounting and statistics metrics and anticipated bounds based on historical monitoring. Alerts can then be generated when any metrics exceed expected bounds or when related metrics diverge. A big advantage here is that the alerting is learnt through analysis of historical data so you do not need to manually set thresholds.
The archive data collection option is recommended for situation alert monitoring in production. The collection interval needs to be short enough that alerts are generated as quickly as possible, but long enough that too much data is not generated. Typically a collection interval of between 1 and 5 minutes is used depending on requirements. Data is usually stored for several weeks but is often subsequently compressed / aggregated and can be used to identify trends and additionally guide capacity planning.
You can use accounting and statistics data to record the load that particular applications, partners or other users put on the system. This allows you to record the relative use of different users and perhaps charge them accordingly. For example, you could levy a nominal charge on every message processed by an integration node or specific message flow.
The archive data collection is recommended for this use case and the collection interval should be set appropriate to requirements – message flow level data is typically adequate. The accounting origin can be utilised to track usage of particular users, more information can be found here:
There are several factors that should be considered when configuring the collection option and level of data that¬†will dictate the amount of data generated. Accounting and Statistics should be configured appropriately for your monitoring requirements and available resources. Data compression / aggregation should be employed to reduce associated resource usage as the data ages.
The following table provides an indication of the amount of data that may be¬†generated per message flow, per node and per thread. The XML element column shows an estimate of the size of the XML element that contributes to the published message, and the attribute values column shows an estimate of the size of the data within that element that will likely be stored in a¬†data warehouse (as tag and attribute names will likely be used as table and column identifiers rather than stored each time):
|Approximate size of data (bytes)|
|Data Level||XML Element||Attribute Values|
|Node (Advanced)||300 + (#terminals x 100)||50 + (#terminals x 20)|
NB. The basic node level data is primarily driven by the length of the names of the nodes. If a node is within a sub-flow then the attribute value for the node name is given a prefix of the sub-flow name as well. The advanced node level data is equivalent to the basic level with additional entries for each terminal on the node.
As an example, if you have a single message flow with 50 nodes and 5 additional instances, and enable message flow level data, node level data (advanced) and thread level data then, if we assume an average of 3 terminals per node, you would expect to generate approximately 33K of data at each publication interval – 1K message flow level, 15K node level data (basic), 15K terminal level data and 2K thread level data. Of this approximately 6K would be expected to be stored in the data warehouse.
With a 1 minute collection interval this message flow would generate ~8.6MB of data per day – if there were 50 similar flows then ~430MB of data would be generated daily.¬†Changing the node level data from advanced to basic so that the terminal level data is not collected would reduce this from ~430MB¬†to ~210MB and subsequently increasing the collection interval to 2 minutes would reduce the data generated to ~105MB per day.
The IBM Tivoli OMEGAMON XE for Messaging for z/OS documentation also provides details of¬†historical table record sizes that are used to warehouse the accounting and statistics data as well as resource statistics data.
Consideration should be taken for the number of message flows, nodes within the flows and number of message flow threads in addition to the collection interval. To reduce the amount of data being generated consider increasing the collection interval or changing what data is collected.
IBM Integration Bus provides a cohesive and comprehensive performance data collection facility that allows message flow performance to be profiled for both short and ling time periods. This performance data can be used in a number of different use cases ranging from problem determination to capacity planning.
Enabling archive statistics and accounting data collection in production systems enables historic monitoring of application behaviour over time, identification of message flows that may require attention and a mechanism to enable a charge-back model. Tune the level of data collection and interval to your requirements taking consideration of system complexity and the amount of data generated. If finer granularity collection intervals are required then other monitoring features of IBM Integration Bus may be more appropriate.
Snapshot data collection enables finer grained problem determination and performance optimisation. Dynamically switching snapshot data collection on in production environment allows for a closer look at specific message flows that require attention.