Hi, currently on QRadar v7.2.8.
I have a question regarding best practice on EPS throttling. Say I have a deployment EPS license limit of 150000. Say I have 400 Log Sources. Logic would tell me to simply divide 150000/400 and configure Admin>Log Source>EPS throttle to no more than 375 per LS. But what about Log Sources that are not configurable via the UI? How are these handled? Are the other EPS throttle limits reduced to accommodate for these Log Sources? What is best practice around this?
Thanks in advance!
Answer by JonathanPechtaIBM (8183) | Jan 23, 2017 at 08:21 AM
Before you throttle to disk, there is a 50,000 memory buffer for event data. After the memory queue is full, then the system will start to write to the on-disk event buffer. Looking through the dev notes on our event buffer, this 5GB buffer will hold a maximum of 10 million events assuming 1k data sizes (250 QRadar data and 750k for payload),
You can use the following calculation to determine the number of seconds that a burst can be maintained before storage is at capacity. If you are doubling or tripling your EPS, then it very well might be a matter of minutes for a sustained burst to fill 5GB worth of data. It just depends on the burst, does it drop under license at all or is it sustained over a long period of time?
Size of spillover buffer / (EPS above license * event size) = # of seconds of storage
In QRadar 7.2.6, we changed disk queues to store smaller files to enhance performance, but more files (2,500 files x 2MB/file). These files are on disk in /store/transient/spillover/queue/ecs-ec.
The qradar.log file keeps track of the numbers of files in the queue that administrators can see at any time:
Feb 27 01:19:10 ::ffff:172.16.16.12 [ecs-ec] [[type=com.q1labs.semsources.filters.QueuedEventThrottleFilter][parent=example1.ibm.lab:ecs-ec/EC/Processor1]] com.q1labs.semsources.filters.QueuedEventThrottleFilter: [INFO] [NOT:0000006000][172.16.16.12/- -] [-/- -] (Current events spillover: 1; Events added last 60 seconds: 109190; Events removed last60 seconds: 109189; Files in use/max: 1/50; Remaining capacity: 10240000)
Hope this helps...let us know if you have follow-up questions.
So if I am seeing Files in use/max: 2500/2500 that means that the memory buffer has been exhausted and the file buffer is currently full. Correct?
I am of the understanding that this is the point where event data is dropped and not processed.
Answer by JonathanPechtaIBM (8183) | Dec 08, 2016 at 11:45 AM
@SCCDAdmin
Data that comes in is absorbed by the pipeline and if you go over license, the system buffers events to memory, then to disk as part of our burst handling feature in our event/flow pipeline. When your license returns to normal, we use the extra EPS in the license to empty the buffer and clear out events from the burst (spillover) queues.
Depending on the protocol, you can throttle some connections to limit the speed at which QRadar is taking in data. For example, the Log File protocol for retrieving ASCII/flat files of events has a throttle, so does JDBC, WinCollect and several others. These throttles are typically used when customers need to limit incoming data to prevent going way over their license for large imports that can spike your EPS rate. Syslog events do not throttle and we take in that data as fast as it is received. You can either let burst handling take care of any overages from Syslog sources or direct some of that EPS to another QRadar Event Processor. More info about burst handling can be found here Event & Flow Burst Handling
It is really going to come down to your network and splitting that 150k EPS license across your EPs. As some EPs might be more active than others and need more EPS. Providing more license capacity to certain appliances that are boundry for data from firewall events or receiving more data than others. You sales rep can help you balance your deployment if you are curious about where to place license capacity. They have some spreadsheets that can help determine EPS rates that you might find helpful.
In QRadar, there is really no need to throttle individual log sources, just stay below your overall EPS rate. We buffer up to 5GB of events and 5GB of flow data in separate burst handling queues. So, there is not much to worry about going over EPS as long as your don't spend an extended amount of time over your license and fill the burst handling buffers.
We typically suggest that for protocols like JDBC or Log FIle protocol that can be importing large amounts of data that you don't have these importing at the same time or that you enable throttles as some DBs that only collect nightly can generate large spikes in EPS as they import data. The same goes for log files from systems that only daily collect logs. Staggering these start times prevents two large data sources from being imported simultaneously and unnecessarily causing buffer.
As of QRadar 7.2.6, you can also drop events from QRadar that are unwanted. These events are added back in to your license at the next interval up to 2,000 EPS per appliance. So, if you have unwanted data, you can reclaim some of that EPS by dropping unwanted events. If you had 4 QRadar Event Processors in your deployment, you could be reclaiming up to 8,000 EPS across those appliances. We have an Open Mic topic on how this license change works.
tl;dr: Not sure if I answered your question or not, but in most cases QRadar has features to deal with systems that go over license at the appliance level via burst handling. There isn't much need to micro-manage license per log source just watch for spikes and instances where you might go way over license and try to balance things out across your existing 150k EPS for the deployment.
If you have follow-up questions, let us know.
Answer by I_Heart_IBM (33) | Dec 27, 2016 at 11:03 AM
Thank you so much for the detailed response, Jonathan. That provides some clarity and is helpful. Thanks again.
Answer by Antanas_V (1) | Jan 23, 2017 at 07:04 AM
This looks nice, but only in theory. 5 GB seems could hold a lot of data, but in reality it doesn't feel like this. In reality it seems that events are being dropped in a very short period of time (minutes) when exceeding license. Whilst QR support is giving the same links and tries to justify the drops rather than finding problems.
So 5 GB - how many events, in average, should it be capable to store before start dropping?