The purpose of this document is to provide information you need to gather in order to ensure a smooth deployment of a large IBM Operations Analytics – Log Analysis install. This “Best Practice” information is based on a recent large consolidation project involving thousands of different log data sources.

Will cover the information needed for each log type and log format. Decisions that have to be taken regarding the logs (loading, indexing, tuning, data retention, data caches, etc.) and how to organize the overall log ingestion.

Documenting log information

When deciding which logs you are going to be integrated, you must provide information for the following items:

For each log file type:

  • File path and naming convention for file path (useful for logs that have same name but are located in different directory).

For example, if you have several WebSphere instances on the same server, profile name might allow to differentiate SystemOut.log for each profile. You the need to ensure that this information is passed to Log Analysis to uniquely identify the log across all logs ingested (from both same server and all servers). In our case, we might have to complement log field with for example hostname, WebSphere profile name (or part of subdirectory path) and then log name to uniquely identify it.

  • File format:
    • how many different log file format are going to be ingested?
    • give log file format with explanation and samples.
    • file headers description.
    • one line or multi line?
    • is there a field delimiter?
    • new line separator (LF, CRLF, none, etc.)
    • date format and timestamp format.
    • any special character.
    • type of data field (TEXT, INTEGRER, DATE, LONG,..).
  • Does the log use log rotate? How often? On highly busy logs, customers might put in place a log rotate mechanism that will compress logs after a while. This information is useful for example to know what is the maximum time a collecting agent can be stopped before the log will be compressed. Hopefully, most IBM Log File Agent parameters are dynamically updated without having to restart with the exception of only few (like the buffer disk cache parameters (BufEvtMaxSize) that need to restart the agent and clean disk cache before changing it).
  • Load volumes and characteristics
    • Provide daily anticipated volume on average, min/max and characteristics of the load over the day for the log.

log-volume-average

    • Provide maximum load volume (for example peak hour, error handling like stack traces,…). As shown in graph below, load is usually similar to a “Gauss” curve during the day. The maximum load is useful in tuning parameters for the agent collecting data and the whole ingestion pipeline.
    • How long time do you expect to store log information locally in collecting agent cache if a failure occurs between log agent and ingestion pipeline (for example network failure, …)? Most log monitoring agent allows to specify a local disk cache setup in order not too loose data in case of network failure. For example, IBM Log File Agent has parameters to specify how much data can be cached on disk in case of network failure (BufferEvents=YES and BufEvtMaxSize=xxxx).
  • Indexing data
    • Choosing what data to index is an important task as indexing has performance impact during ingestion and when rendering results through facet counting. Best practices recommend to index around 25 % of fields if possible.
    • For each field, determine if it has to be indexed. IOA-LA supports various indexes option:
      • Retrievable: Determines whether the contents of this field are stored for retrieval in the index.
      • Sortable: enable or disable the field for sorting and range queries
      • Filterable: enable or disable the facet counting and filtering on this field
      • Searchable: controls whether the field is enabled for searching / matching against it
  • Log shipping approach:

Determine which agent will be used to ship log from source host to IOA-LA : IBM Log File Agent (LFA) for IBM, rsyslog, Logstash, script, ftp, http REST API, opsec (Checkpoint) , scp, Beats, Windows Event Log centralization mechanism, ….

  • Record size and document (pdf, word doc) in log
    • What is the maximum record size anticipated? Determine largest record in log file: as with java stack trace or customer traces, log record can be large. Both for IBM Log File Agent and IOA-LA Generic receiver, it is important to set correct value to ensure largest record can be ingested. For IBM Log File Agent, this is specified by the EventMaxSize parameter in the configuration file.
    • Do you plan to have in some record document like pdf or word file? If yes, what are the tag that allow to remove such information from log record (for example xml tag in a log record). For example, in a recent project, in development environment or when application was running in debug mode, developers were generating in some occasion pdf document to understand what was generated by the software application. In such case, it is a good idea to remove those document from log record by having appropriate tag that delimit the document record in order to remove that part from the log record.
  • Document monitoring log ingestion rate

With thousands of log ingested, it became important to be able to monitor log ingestion and have the capability to know when a log has stop ingesting for various reason. For each kind of logs, it is important to specify how often logs entries should be expected. IOA-LA has a mechanism to alert when log ingestion is missing.

  • Log and record filtering / retention
    • How long log record has to be kept (IOA-LA usably keep 30 days of data in SOLR backend and longer data in Hadoop).
    • A lot of web server are monitored through heartbeat URL/health check. That generates potentially a high number of log entry not really useful. It might be interesting to filter out those data before they are ingested.
  • Dashboard requirement

For each log type, specify what kind of information to be presented in dashboard. Typical dashboard includes usually error trend volume (error, warning, …), load volume over time, performance information like response time or usage.

  • Aliasing

Raw log field information is sometime not meaningful for all team that are going to use IOA-LA. For example, in a recent project, some team were using hostname while other team where rather using “business hostname”. IOA-LA allows to defined aliases mapping in a json file (aliasSource.json)

  • Alerting
    • Define for each type of log alerts that are interesting to trigger. IOA-LA supports the following alerting mechanism:
      • Keyword match: look for one or more keywords
      • Keyword match based on threshold: look for one or more keywords that occur more than specific number of time in specific time window
      • Keyword match with de-duplicate: look for one or more keywords with de-duplication
      • Co-occurrence match: look for two or more keywords that occur during same time period
      • Anomaly match and anomaly match with Threshold
    • Define how alerts are to be sent. IOA-LA supports following alerting mechanism: Index alerts in _alerts datasource, sendmail, write to log, script, snmp trap and EIF/Netcool Omnibus.

Overall organization of log file

Once each individual file type has been documented, an overall review of all log ingestion has to be done.

Overall log ingestion throughput characterization for each server.

On each server:

What is the maximum load anticipated overall for a particular file, a particular server?

How long you want the LFA to keep data locally if IOA-LA/Logstash/intermediate layer server is stopped or there is a network failure? How should we size cache file for ALL logs on that server to meet previous requirement?

IBM Log File Agent naming convention: LFA choose the first 15 characters of a log file as a way to uniquely identify a log. It is important to ensure that LFA configuration name is unique on one server to avoid naming collision.

As you can see in the following figure from an online application, load can be largely different depending on the server.

Load breakdown by hostname (only showing first 37 hosts)
Load breakdown by hostname (only showing first 37 hosts)

Global log ingestion throughput characterization

What is the maximum load anticipated for the solution?

When you expect to have a big application problem, what is the maximum load the log will generate?

When designing the ingestion pipeline, don’t consider all logs load is evenly distributed.

It is quite common that only few number of logs will generate most of the load.

For example, the entry point for a Web application will be in charge of all security aspect (Authorization, Authentication, Accounting). This is generally done by few server/log generating a high volume of data as you can see in the following graph. In such case, less than 10 logs were generating around 76 % of total log load. This also means that for those high throughput logs, it is important to size IBM Log File Agent disk cache (see disk size section for more info) and fine tune IBM Log File Agent parameters to sustain the load. One particular IBM Log File Agent parameters to tune is  MaxEventQueueDepth.

As you can see in the following graph, on around 10 000 logs, only few logs were generating the most of the load.

Log instance breakdown volume percentage
Log instance breakdown volume percentage

Four logs were actually the entry point (authentication, authorization, accounting) of the application. Others important logs were business logic entry point further down the application architecture. For those specific set of logs, agent collecting the log should be carefully tune in order to sustain such high load for single file.

Log consolidation:

As there might be a high number of log across all monitored log for an application, log consolidation has to be put in place. Log consolidation allows to add one or more abstraction or grouping level to navigate more easily through log hierarchy. In a recent project, we end up having on some server 500 logs. The whole application was composed of around 10000 logs. It would be hard to navigate through standard hierarchy with a list of thousands of logs. Decision was taken to complement log content with some meta data that allow to more easily navigate through data. To consolidate multiple logs into a single data source, you enrich the individual logs with meta data about their source and content. The IBM Log File Agent can add that meta data. You can then combine multiple logs into one instance of Logstash. After Logstash annotates the log messages and the meta data, it sends the consolidated stream of log data to the Log Analysis server for indexing. For searches, users can search through multiple log files by querying only one consolidated data source, then drilling down from facet results to the individual logs using the additional meta data.

Refactoring

Can some log be refactored (i.e. have a common format) to simplify the environment? Is there some log content that can be refactor in order reduce the number of log type ingested in the product?

End to end tracking: refactoring field after collecting all log format. When monitoring a large application, logs format come from various department that don’t have always the same naming convention to represent the same information (ie hostname/ip address, customer number, Log4J Mapped Diagnostic Context (MDC), Log4j Nested Diagnostic Contexts (NDC), correlation ID,..). In order to be able to do search across log (like tracking an end user), it is important to define common field name across different log file format (data source type).

Access control et security

Security in IOA-LA is controlled at the data source level: care should be taken when refactoring log or defining data source as IOA-LA define role base access by file type (date source level).

Size of the environment

How many servers are in the scope of log consolidation?

How many log are to be monitored as this number might be larger than anticipated, especially with Micro-service architecture that tend to breakdown application is very small sub-component, each using its own log.

  • For example, in an online application, log number could be around 500 on some kind of server though there may be less than 20 different types of them (different file format). With around 56 servers, this lead to ingest around 10 000 logs for one application (for around 90 Gb/day).

Disk cache sizing for all logs going through the ingestion pipeline

If for any reason, a failure occurs in the ingestion pipeline, disk caches have to be configured in order to be able to save data.

Disk cache at the collecting log agent (source server):

Define for each server the size of disk filesystems that will handle data cache in case of communication problem between agent and ingestion pipeline. That size is a tradeoff between the size you want to allow for that disk cache and what is the maximum time you estimate a communication failure can occur between log agent and ingestion pipeline. Basically, this is the sum-up of information provided for each individual file previously. Special care should be taken about single file with very high load and server handling a lot of file at the same time.

Disk cache at the collecting log agent (source server):

Define for each server the size of disk filesystems that will handle data cache in case of communication problem between agent and ingestion pipeline. That size is a tradeoff between the size you want to allow for that disk cache and what is the maximum time you estimate a communication failure can occur between log agent and ingestion pipeline. Basically, this is the sum-up of information provided for each individual file previously. Special care should be taken about single file with very high load and server handling a lot of file at the same time.

Disk cache on the Logstash server:

What is the maximum throughput anticipated for the whole ingestion pipeline?

If you are using Logstash, you can specify disk cache depending on how long you can wait before having the disk cache filled.

Other application behaviors to evaluate when designing log ingestion solution

Application can be split in two datacenters for high availability, load balancing and agile method. With agile development method, customer usually ensure the full load can be run on only one datacenter allowing the other part of servers to be updated with new application code. This results in great change in the way load is distributed. So for example, you might have all load in Datacenter 1, then load is transparently evenly distributed to the two datacenter and then after a while all load is put to Datacenter 2. When designing the solution, you should ensure that all tuning parameters and disk cache are able to handle the previously mentioned scenario.

Data retention

IOA-LA has the concept of HOT, COLD and FROZEN tiers. HOT&COLD tiers are handled by the SOLR backend. Frozen tier is handle by Hadoop (more information here https://developer.ibm.com/itoa/2015/06/22/log-analysis-hadoop/).

Hot Tier – This tier olds most recently indexed data. A larger fraction of data is stored in memory. Interactive search is supported and this results– faster searches with more memory and processor allocation. The indexed data is stored on Apache Solr.

Cold Tier – This tier holds a few weeks or couple of months of indexed data. This supports Disk based access with lower memory utilization than the Hot tier. Incremental searches are fast with moderate memory and processor allocation. The indexed data is stored on Apache Solr.

Frozen Tier – This tier enables long term storage of highly compressed data on Hadoop File System (HDFS). This tier has low storage & memory requirements. You can search, report, model and mine over historical data. The searches are scan based and slower than the Cold tier. Data on HDFS is partitioned by time and data source.

When designing the solution, determine based on your needs how much data you want to keep in the Hot tier (usually a few days) to improve response time when doing a query. This is specified in the unitysetup.properties files with parameters : HOT_TIER_PERIOD=2 for two days.

1 comment on"Best practices for adding Log Analysis log data sources"

  1. Clyde E. Richardson June 11, 2017

    This is good information but I don’t see any information regarding the ingestion of z Systems LOG & SMF data. Do you have any recommendations on how to, in detail properly account for the ingestion of this type of data?

    The IBM Operations Analytics for z Systems solution uses the IOA-LA server to ingest z Systems related data but there should also be guidelines / best practice recommendations for how to properly size the IOA-LA server, and the Indexing Engine.

    Can you help or provide some recommendation? If not, perhaps I can work with your team to develop some recommendations.

Join The Discussion

Your email address will not be published. Required fields are marked *