By Frank Quéau (fqueau@fr.ibm.com), Client Technical Professional

The purpose of this document is to present various information you need to gather in order to ensure a smooth deployment of a large log analysis deployment using IOA-LA. That information is based on a large consolidation project (thousands of logs) ran recently. We will cover the information needed for¬†each log type and log format, decision that has to be taken regarding the logs¬†(load, indexing, tuning, data retention, data caches, …..) and how to¬†organize log ingestion overall.

Documenting log information

When deciding which logs you are going to be integrated, you must provide
information for the following items.

For each log file type:

·
File path and naming convention for file path (useful for logs that have same name
but are located in different directory). For example, if you have several
WebSphere instances on the same server, profile name might allow to
differentiate SystemOut.log for each profile.  You the need to ensure that this
information is passed to Log Analysis to uniquely identify the log across all
logs ingested (from both same server and all servers). In our case, we might
have to complement log field with for example hostname, WebSphere profile name
(or part of subdirectory path) and then log name to uniquely identify it.

 

·
File format:

o   how many different log file format are going to be ingested?

o   give log file format with explanation and samples

o   file headers description

o   one line or multi line ?

o   is there a field delimiter?  

o   new line separator  LF,CRLF, none, etc.)

o   date format and timestamp format

o   any special character

o   type of data field (TEXT,INTEGRER, DATE, LONG,..),

 

·
Does the log use log rotate? How often? On highly busy log, customer might put in
place a log rotate mechanism that will compress log after a while. This
information is useful for example to know what is the maximum time a collecting
agent can be stopped before the log will be compressed. Hopefully, most IBM Log
File Agent parameters are dynamically updated without having to restart with
the exception of only few (like the buffer disk cache parameters (
BufEvtMaxSize ) that need to restart the agent and clean disk cache before
changing it)

 

 

·
Load volumes and characteristics

o   Provide daily anticipated volume on average, min/max and characteristics of the load over the day for the
log.

o 


Provide maximum load volume (for example peak hour, error handling like stack traces,…). As shown in
graph below, load is usually similar to a ‚ÄúGauss‚ÄĚ curve during the day. The
maximum load is useful in tuning parameters for the agent collecting data and
the whole ingestion pipeline.
  

 

o   How long time do you expect to store log information locally in collecting agent cache if a failure
occurs between log agent and ingestion pipeline (for example network failure,
…)? Most log monitoring agent allow to specify a local disk cache setup in
order not too loose data in case of network failure. For example, IBM Log File
Agent has parameters to specify how much data can be cached on disk in case of
network failure (BufferEvents=YES and
BufEvtMaxSize=xxxx).

 

· Indexing data

o   Choosing what data to index is an important task as indexing has performance impact during ingestion
and when rendering results through facet counting. Best practices recommend to
index around 25 % of fields if possible.

o   For each field, determine if it has to be indexed. IOA-LA supports various indexes option:

§  Retrievable: Determines whether the contents of this field are stored for retrieval in the index.

§  Sortable: enable or disable the field for sorting and range queries

§  Filterable: enable or disable the facet counting and filtering on this field

§  Searchable: controls whether the field is enabled for searching / matching against it

 

 

 

· Log shipping approach:

o   Determine which agent will be used to ship log from source host to IOA-LA : IBM Log File Agent (LFA) for
IBM, rsyslog, Logstash, script, ftp, http REST API, opsec (Checkpoint) , scp,
Beats, Windows Event Log centralization mechanism, ….

o   Recent version of IOA-LA leverage kafka large reliable ingestion architecture as shown in following
diagram

 

· Record size and document (pdf, word doc) in log

o   What is the maximum record size anticipated? Determine largest record in log file: as with java stack
trace or customer traces, log record can be large. Both for IBM Log File Agent
and IOA-LA Generic receiver, it is important to set correct value to ensure
largest record can be ingested.  For IBM Log File Agent, this is specified by
the EventMaxSize parameter in the configuration file.

o   Do you plan to have in some record document like pdf or word file? If yes, what are the tag that allow
to remove such information from log record (for example xml tag in a log
record). For example, in a recent project, in development environment or when
application was running in debug mode, developers were generating in some occasion
pdf document to understand what was generated by the software application. In
such case, it is a good idea to remove those document from log record by having
appropriate tag that delimit the document record in order to remove that part
from the log record.

 

· Document monitoring log ingestion rate

o   With thousands of logs ingested, it became important to be able to monitor log ingestion and have the
capability to know when a log has stop ingesting for various reason. For each
kind of logs, it is important to specify how often logs entries should be
expected. IOA-LA has a mechanism to alert when log ingestion is missing.

 

· Log
and record filtering / retention

o   How long log record has to
be kept (IOA-LA usably keep 30 days of data in SOLR backend and longer data in
Hadoop).

o   A lot of web server are monitored through heartbeat URL/health check. That generates potentially a high
number of log entry not really useful. It might be interesting to filter out
those data before they are ingested.

 

· Dashboard requirement

o   For each log type, specify what kind of information to be presented in dashboard. Typical dashboard
includes usually error trend volume (error, warning, …), load volume over time,
performance information like response time or usage

·
Aliasing

o   Raw log field information is sometime not meaningful for all team that are going to use IOA-LA. For
example, in a recent project, some team were using hostname while other team
where rather using ‚Äúbusiness hostname‚ÄĚ. IOA-LA allows to defined aliases
mapping in a json file (aliasSource.json)

 

· Alerting

o   Define for each type of log alerts that are interesting to trigger. IOA-LA supports the following
alerting mechanism:

§  Keyword match: look for one or more keywords

§  Keyword match based on threshold: look for one or more keywords that occur more than specific number of time in specific time window

§  Keyword match with de-duplicate: look for one or more keywords with de-duplication

§  Co-occurrence match: look for two or more keywords that occur during same time period

§  Anomaly match and anomaly match with Threshold

o   Define how alerts are to be sent. IOA-LA supports following alerting mechanism: Index alerts in _alerts
datasource, sendmail, write to log, script, snmp trap and EIF/Netcool Omnibus.

 

Overall organization of log file

Once each individual file type has been documented, an overall review of all log
ingestion has to be done.

Overall log ingestion throughput characterization for each
server

On each server:

What is the maximum load anticipated overall for a particular file, a particular
server?

How long you want the LFA to keep data locally if IOA-LA/Logstash/intermediate
layer server is stopped or there is a network failure? How should we size cache
file for ALL logs on that server to meet previous requirement?

IBM Log File Agent naming convention: LFA choose the first 15 characters of a log
file as a way to uniquely identify a log. It is important to ensure that LFA
configuration name is unique on one server to avoid naming collision.

 

As you can see in the following figure from an online application, load can be
largely different depending on the server.

Figure 1: load breakdown by hostname (only showing first 37 hosts)

Global log ingestion throughput characterization

 

What is the maximum load anticipated for the solution?

When you expect to have a big application problem, what is the maximum load the log
will generate?

When designing the ingestion pipeline, don’t consider all logs load is evenly
distributed.

It is quite common that only few number of logs will generate most of the load.

For example, the entry point for a Web application will be in charge of all
security aspect (Authorization, Authentication, Accounting). This is generally
done by few server/log generating a high volume of data as you can see in the
following graph. In such case, less than 10 logs were generating around 76 % of
total log load. This also means that for those high throughput logs, it is
important to size IBM Log File Agent disk cache (see disk size section for more
info) and fine tune IBM Log File Agent parameters to sustain the load. One
particular IBM Log File Agent parameters to tune is 
MaxEventQueueDepth.

 

As you can see in the following graph, on around 10 000 logs, only few logs were
generating the most of the load.

Figure 2: log instance breakdown volume percentage

 

Four logs were actually the entry point (authentication, authorization, accounting)
of the application. Others important logs were business logic entry point further
down the application architecture. For those specific set of logs, agent
collecting the log should be carefully tune in order to sustain such high load
for single file.

 

Log consolidation:

As there might be a high number of log across all monitored log for an
application, log consolidation has to be put in place. Log consolidation allows
to add one or more abstraction or grouping level to navigate more easily
through log hierarchy. In a recent project, we end up having on some server 500
logs. The whole application was composed of around 10000 logs. It would be hard
to navigate through standard hierarchy with a list of thousands of logs.
Decision was taken to complement log content with some meta data that allow to
more easily navigate through data. To consolidate multiple logs into a single
data source, you enrich the individual logs with meta data about their source
and content. The IBM Log File Agent can add that meta data. You can then combine
multiple logs into one instance of Logstash. After Logstash annotates the log
messages and the meta data, it sends the consolidated stream of log data to the
Log Analysis server for indexing.  For searches, users can search through
multiple log files by querying only one consolidated data source, then drilling
down from facet results to the individual logs using the additional meta data.

 

Refactoring

Can some log be refactored (i.e. have a common format) to simplify the environment?
Is there some log content that can be refactor in order reduce the number of
log type ingested in the product?

End to end tracking: refactoring field after collecting all log format. When
monitoring a large application, logs format come from various department that
don’t have always the same naming convention to represent the same information
(ie hostname/ip address, customer number, Log4J Mapped Diagnostic Context
(MDC), Log4j Nested Diagnostic Contexts (NDC), correlation ID,..). In order to
be able to do search across log (like tracking an end user), it is important to
define common field name across different log file format (data source type).

Access control et security

Security in IOA-LA is controlled at the data source level:  care should be taken when
refactoring log or defining data source as IOA-LA define role base access by
file type (date source level). 

Size of the environment

How many servers are in the scope of log consolidation?

How many log are to be monitored as this number might be larger than anticipated, especially
with Micro-service architecture that tend to breakdown application is very
small sub-component, each using its own log.

· For example, in an online application, log number could be around 500 on some kind
of server though there may be less than 20 different types of them (different
file format). With around 56 servers, this lead to ingest around 10 000 logs
for one application (for around 90 Gb/day).

Disk cache sizing for all logs going through the ingestion
pipeline

If for any reason, a failure occurs in the ingestion pipeline, disk caches have to
be configured in order to be able to save data.

Disk cache at the collecting log agent (source server):

Define for each server the size of disk filesystems that will handle data cache in
case of communication problem between agent and ingestion pipeline. That size
is a tradeoff between the size you want to allow for that disk cache and what
is the maximum time you estimate a communication failure can occur between log
agent and ingestion pipeline. Basically, this is the sum-up of information
provided for each individual file previously. Special care should be taken
about single file with very high load and server handling a lot of file at the
same time.

Disk cache on the Logstash server:

What is the maximum throughput anticipated for the whole ingestion pipeline?

If you are using Logstash, you can specify disk cache depending on how long you
can wait before having the disk cache filled.

Other application behaviors to evaluate when designing log ingestion solution

Application can be split in two datacenters for high availability, load balancing and agile
method. With agile development method, customer usually ensure the full load
can be run on only one datacenter allowing the other part of servers to be
updated with new application code. This results in great change in the way load
is distributed. So for example, you might have all load in Datacenter 1, then
load is transparently evenly distributed to the two datacenter and then after a
while all load is put to Datacenter 2. When designing the solution, you should
ensure that all tuning parameters and disk cache are able to handle the
previously mentioned scenario.

Data retention

IOA-LA has the concept of HOT, COLD and FROZEN tiers. HOT&COLD tiers are handle by
SOLR backend. Frozen tier is handle by Hadoop. (more information here
https://developer.ibm.com/itoa/2015/06/22/log-analysis-hadoop/)

Hot Tier ‚Äď This tier olds most recently indexed data. A larger fraction of data
is stored in memory. Interactive search is supported and this results‚Äď faster
searches with more memory and processor allocation. The indexed data is stored
on Apache Solr.
Cold Tier ‚Äď This tier holds a few weeks or couple of months of indexed
data. This supports Disk based access with lower memory utilization than the
Hot tier. Incremental searches are fast with moderate memory and processor
allocation. The indexed data is stored on Apache Solr.
Frozen Tier ‚Äď This tier enables long term storage of highly compressed
data on Hadoop File System (HDFS). This tier has low storage & memory
requirements. You can search, report, model and mine over historical data. The
searches are scan based and slower than the Cold tier. Data on HDFS is
partitioned by time and data source.

When designing the solution, determine based on your needs how much data you want to
keep in the Hot tier (usually few days) to improve response time when doing a
query. This is specify in the unitysetup.properties files with parameters :
HOT_TIER_PERIOD=2
for two days.

Join The Discussion

Your email address will not be published. Required fields are marked *