Request timing is a new Liberty feature which automatically identifies slow and hung requests and provides detailed information to help you find the root cause. This article explains how to use the request timing capability in WAS Liberty to detect and diagnose problems related to slowness of a request.
Until now, diagnosing the cause of slow web requests in application servers has often been a challenging endeavor. The request timing feature in WAS Liberty 220.127.116.11 makes it easier.
Traditional approaches for finding problems related to slowness of a request involve using a variety of tools in conjunction with trace and log information. Some examples are:
- Slow spots in the code are identified by profiling. Profiling is good at showing which methods are slow, but doesn’t generally show contextual details such as servlet parameters or SQL statements that are often needed. Not everyone is willing to run profilers on their production systems.
- Database query slowness requires identifying slow SQL queries. This can be done by enabling trace on the application server. Tracing causes overhead and reviewing trace output is not an easy task.
- Monitoring tools are used to detect bottlenecks in resource pools, such as the thread pool, JDBC connection pool etc. Monitoring can help you find certain problems, but won’t help you unless the root cause ties closely to a tracked metric.
The other key problem with the above approaches is that they generally require enabling some form of diagnostics and recreating problem scenarios – which is not always possible and which prolongs problem diagnosis.
Request timing is a new Liberty feature which automatically identifies slow and hung requests and provides detailed information to help you find the root cause. Request timing can run all the time – enabling slow and hung requests to be documented when they occur, without the need for problem recreation to collect more data.
Slow request detection
Request timing includes slow request detection. This capability is useful for identifying requests that have run for longer than a slow request threshold that you can configure. For each request that takes too long to complete, request timing prints a warning message along with the details captured for the request into the log file. A maximum of three messages are printed per slow request after which request timing stops monitoring the request.
The information captured for a request includes:
- The current stack for the thread that the request is using.
- A dump of the tree of events that happened during the request
Hung request detection
Request timing also includes hung request detection. Hung request detection works similarly to slow request detection. A request is considered to be hung when it has run for longer than a hung request threshold that you can configure. For a hung request, apart from printing a warning message along with the relevant details, request timing initiates Java cores. Three Java cores are generated for a hung request, each a minute apart.
How does request timing work?
Request timing is built on top of a new bytecode injected request probe framework. The request probe framework works by adding probes at predefined locations in Liberty components. Instrumented components generate events at these predefined probe points. The request probe framework tracks each request throughout its life in the Liberty server. The framework assigns each request a unique request ID and stores information about the events that happen throughout the request. Events are stored in a event tree, which is built as the request progresses.
Every event has the following attributes
- Event type, identifies the kind of an event and provides information about the event source.
- Context information, describes an instance of an event and provides information relevant to the event type. Depending on the event type this information will vary.
- Duration, captures the time taken by the event.
At the time of writing this article, there are event types for servlets, sessions and database connectivity. Please refer to the Liberty knowledge center for the latest and full list of supported event types.
Configuring slow and hung request detection
Enable request timing, including both slow and hung request detection, by adding the
requestTiming-1.0 feature to the feature manager in the Liberty
server.xml file as follows:
<featureManager</> <feature/>requestTiming-1.0</feature> </featureManager>
The default behavior for request timing is to consider any requests that take at least 10 seconds to be slow, and to consider any requests that take at least 10 minutes to be hanging. The request timing feature provides optional configuration options as follows:
<requestTiming includeContextInfo="true" slowRequestThreshold="10s" hungRequestThreshold="10m" sampleRate="1" />
slowRequestThresholdis the duration of time that a request can run before being considered slow. Default is 10 seconds. Set to
0to disable slow request checking.
hungRequestThresholdis the duration of time that a request can run before being considered hung. Default is 10 minutes. Set to
0to disable hung request checking.
sampleRateis the rate at which the sampling should happen for the slow request tracking. Default is every request (1).
includeContextInfoindicates if the context information details are included in the log output. Default is
How you configure the request timing feature depends on your use case.
Configuring request timing on production servers
For a production server, use request timing to watch for slow or hung requests that may warrant further investigation. A good time to tune request timing threshold values is during load testing. As with any pre-production performance testing, you should try to use a workload representative of what you are expecting to see in production, and should run on hardware that is as similar as possible to the target environment.
While the workload driving tool runs, try different values for the
slowRequestThreshold setting to find one that you like. Setting the value too low isn’t helpful — avoid picking a value so low that you wouldn’t care to investigate the resultant slow request warnings. For example, if you don’t think you’d care to investigate any requests that complete within 2 seconds you should probably pick a number bigger than 2 seconds for your setting.
For hung request detection, keep in mind that any requests that last longer than the
hungRequestThreshold will trigger 3 javacores to be created. Pick a number that isn’t too small or the system will create unnecessary javacores. Something in the range of 1m-10m is probably reasonable.
Configuring request timing in development and test environments
For development and test environments, use request timing to see all requests that are slow. Set the value of
slowRequestThreshold to a fairly low value. This will give you a feel for which of your servlets take the longest to run, and will give you a chance to investigate the slow request reports to see if there’s anything going on in your application that you weren’t expecting.
Note that setting the
0 disables slow request checking, and setting the
slowRequestThreshold to a small value increases the frequency with which the server will scan for slow requests. Consider setting the
slowRequestThreshold to something in the range of 300ms – 500ms.
You will normally leave the
sampleRate set to
1 and the
includeContextInfo set to
true unless you see a performance issue when running with slow request timing enabled. Increasing the
sampleRate makes the slow request detection skip requests; for example setting it to
100 would make the server only check 1 out of every 100 requests for slowness.
false causes the runtime to not get the context for events, which saves some CPU, but greatly reduces the value of the event trees. Note that hung request detection applies to all requests regardless of the
sampleRate setting (this is so that you’ll never miss a hung request, even one that happens due to a sporadic condition).
What output to expect
When a request in the server has been running for longer than the
slowRequestThreshold, you will get a slow request timing message written to your
[6/8/15 15:12:35:101 IST] 00000040 com.ibm.ws.request.timing.manager.SlowRequestManager W TRAS0112W: Request AABU1rUO7kq_AAAAAAAAAAA has been running on thread 0000003f for at least 2009.020ms. The following stack trace shows what this thread is currently running. at org.apache.derby.impl.sql.execute.ScalarAggregateResultSet.getNextRowCore(Unknown Source) at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown Source) at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(Unknown Source) at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(Unknown Source) at org.apache.derby.impl.jdbc.EmbedResultSet.next(Unknown Source) at com.ibm.ws.rsadapter.jdbc.WSJdbcResultSet.next(WSJdbcResultSet.java:2356) at com.ibm.websphere.samples.trade.direct.TradeDirect.resetTrade(TradeDirect.java:1884) at com.ibm.websphere.samples.trade.web.TradeConfigServlet.doResetTrade(TradeConfigServlet.java:65) at com.ibm.websphere.samples.trade.web.TradeConfigServlet.service(TradeConfigServlet.java:348) at javax.servlet.http.HttpServlet.service(HttpServlet.java:668) at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1285) .... at com.ibm.ws.http.dispatcher.internal.channel.HttpDispatcherLink$TaskWrapper.run(HttpDispatcherLink.java:955) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:857) The following table shows the events that have run during this request. Duration Operation 2011.614ms + websphere.servlet.service | TradeWeb | TradeConfigServlet?action=resetTrade 56.499ms websphere.datasource.psExecuteUpdate | jdbc/TradeDataSource | delete from holdingejb where holdingejb.account .. 0.411ms websphere.datasource.psExecuteUpdate | jdbc/TradeDataSource | delete from accountprofileejb where userid like .. 0.365ms websphere.datasource.psExecuteQuery | jdbc/TradeDataSource | select count(symbol) as "tradeStockCount" from .. 34.901ms websphere.datasource.psExecuteUpdate | jdbc/TradeDataSource | update accountejb set logoutCount=0 ..
When a request in the server has been running for longer than the
hungRequestThreshold, you will get hung request timing message written to your
[6/9/15 14:50:19:937 IST] 00000034 com.ibm.ws.request.timing.manager.HungRequestManager W TRAS0114W: Request AABoLUq4OHj_AAAAAAAAAAA has been running on thread 00000038 for at least 2001.732ms. The following table shows the events that have run during this request. Duration Operation 2002.355ms + websphere.servlet.service | TradeWeb | TradeConfigServlet?action=resetTrade 22.141ms websphere.datasource.psExecuteUpdate | jdbc/TradeDataSource | delete from holdingejb where holdingejb.account .. 0.433ms websphere.datasource.psExecuteUpdate | jdbc/TradeDataSource | delete from accountprofileejb where userid like .. 0.306ms websphere.datasource.psExecuteQuery | jdbc/TradeDataSource | select count(symbol) as "tradeStockCount" from .. 34.526ms websphere.datasource.psExecuteUpdate | jdbc/TradeDataSource | update accountejb set logoutCount=0 .. [6/9/15 14:50:19:943 IST] 00000034 com.ibm.ws.kernel.launch.internal.FrameworkManager A CWWKE0067I: Java dump request received. [6/9/15 14:50:20:684 IST] 00000034 com.ibm.ws.kernel.launch.internal.FrameworkManager A CWWKE0068I: Java dump created: C:WorkIMD2015build.imagewlpusrserverstestServerjavacore.20150609.145019.6116.0001.txt [6/9/15 14:51:19:943 IST] 00000046 com.ibm.ws.kernel.launch.internal.FrameworkManager A CWWKE0067I: Java dump request received. [6/9/15 14:51:21:083 IST] 00000046 com.ibm.ws.kernel.launch.internal.FrameworkManager A CWWKE0068I: Java dump created: C:WorkIMD2015build.imagewlpusrserverstestServerjavacore.20150609.145119.6116.0002.txt [6/9/15 14:52:19:945 IST] 0000004b com.ibm.ws.kernel.launch.internal.FrameworkManager A CWWKE0067I: Java dump request received. [6/9/15 14:52:20:601 IST] 0000004b com.ibm.ws.kernel.launch.internal.FrameworkManager A CWWKE0068I: Java dump created: C:WorkIMD2015build.imagewlpusrserverstestServerjavacore.20150609.145219.6116.0003.txt
When at least one request is determined to be hung, a set of 3 javacores will be initiated. No further javacores will be triggered by the request timing feature until the first set of javacores completes, and not until the list of hung requests changes from what triggered the previous javacores
Acting on slow request warnings
Garbage collection or momentary bursts of traffic to your site can be responsible for occasional slow requests, so you may not want to investigate a particular slow request warning unless the same events have previously caused other slow requests. When you see the same events repeatedly triggering slow request warnings you will likely want to find and address the cause.
The slow request detection warning message provides valuable info to help you diagnose why the request is still in progress:
- The start of the message tells you the request ID (for example, `AABU1rUO7kq_AAAAAAAAAAA`). This is helpful as any log or trace entries for the same request will be tagged with this same request ID. To see the request ID in the logs and trace you’ll need to be using binary logging, which is able to show log records with their extensions.
For example, `binaryLog view servername –format=advanced`
- The thread ID listed in the message tells you which thread the request is running on (for example, `0000003f`). This is another way to see log or trace entries for the same request as the thread ID is part of each log and trace entry.
- The duration of the request, up until the moment the slow request detection decided to report on it, is listed (for example, `2009.020ms`). This is useful to note as the slow request detection message prints up to 3 times for each slow request — at 1x, 2x, and 3x the `slowRequestThreshold`.
- The stack trace shows you the specific line of code being executed and the call stack of the thread the slow request is running on. For slower requests that may have up to 3 reports you can compare the stack traces from each report to see if the same method is still executing (indicative of a very slow or very frequently called method) or to check if there are any repeating sets of lines in the call stack (indicative of a recursive loop).
- The table of events shows how long each operation (event) within the request has taken to run. The duration of each event is shown in the duration column, and the details of the event are shown in the `Operation` column. A plus sign (+) is shown between the duration and operation columns for events that had not completed at the time the report was generated.
Take a look at the table of events to see where the time has been spent in the request. There are many different scenarios you may encounter:
- You may see a large number of events, such as JDBC events. If you developed the app, you can decide if all of the SQL statements are necessary. If you’re not the developer, you may want to discuss the report with the developer.
- You may see one or more very long JDBC events. In some cases the queries may be able to be optimized, or new indices can be created in the database to reduce the time it takes to perform those queries.
- You may see long events that are not related to JDBC — for example long servlet events. While an exhaustive coverage of how to diagnose performance slowdowns is beyond the scope of this article, if there are no child events that indicate where the time was spent you may want to:
- Use system monitoring tools, such as `vmstat` and `iostat`, to check CPU and I/O levels. If your whole system is out of resources everything will be slow and you will likely be seeing a lot of slow request warnings.
- Look at the stack trace in the slow request warning message to get clues about what is slow. Comparing the stack trace of the 1st, 2nd, and 3rd slow request messages (if there are multiple messages) will also help you see what the cause of the slowness is in cases where that event has not yet finished (as would be indicated by a plus sign).
- Monitor your system to check for connection wait time or a lack of threads in your thread pool
- Use a profiler, such as the *IBM Monitoring and Diagnostic Tools – Health Center* tool (available as a tool add-on to IBM Support Assistant), to see which specific methods the time is being spent in.
Acting on hung request warnings
There are a number of reasons a request may hang:
- Long wait – When one thread (T1) spends a long time executing code while holding a lock (L1), causing all other threads (T2…Tn) that want to access the same code to have to wait. Often T1 is stuck waiting for availability of some resource, such as a connection from a connection pool.
- Deadlock – When a thread (T1) holds a lock (L1), and another thread (T2) holds another lock (L2), and where T1 is waiting for ownership of L2 while T2 is waiting for ownership of L1. This is commonly referred to as a deadlock.
- Loop – When an improperly coded method executes code in a loop that never ends.
- Infinite recursion – When an improperly coded method calls back to itself or to another method that eventually results in reaching the same point in the code, causing infinite recursion.
If you have multiple hung request messages for the same request you can compare the event trees to see if new events are being executed or not. If new events are executing then you can rule out long waits and deadlocks as the problem. In these cases you may be able to tell from the growing list of events what’s causing the request to not reach completion.
If the event trees do not grow between warning messages, you may have a long wait or deadlock. The easiest way to diagnose those kinds of problems is to analyze the javacore files that the request timing feature collects when hangs are detected. The IBM Thread and Monitor Dump Analyzer for Java tool (also available as a tool add-on to IBM Support Assistant) can help with this analysis.
The new request timing feature in WAS Liberty can help you save time detecting and diagnosing the cause of slow and hanging requests. Give it a try and let us know what you think!