Going off the rails on a crazy train

In a dramatic scene from the 2004 movie Spider-Man 2, hero Peter Parker finds himself caught on a runaway train loaded with passengers and hurtling toward the end of its track. Mustering all his spidery super powers, he brings the train to a gut-wrenching halt just before it plunges to certain destruction.

You won’t often have tentacled super-villains trying to push your IBM i system past its limits. But it can sometimes seem as if you do: ad-hoc queries, bad application logic, or that one frustrated user pressing Submit over and over can exhaust your system’s temporary storage and leave it maxed out and unresponsive. As with runaway trains, it is best to stop runaway jobs before they gain a full head of steam. Improved support from IBM for temporary storage controls makes this easier than ever. So, put on your favorite superhero mask, and keep on reading to find out how to increase your own query-taming superpowers.

Enhanced control with new PTFs

For many years, IBM i has provided the ability to prevent individual jobs from allocating excessive temporary storage. Now, with the program temporary fixes (PTFs) for IBM i 7.2 and 7.3 listed below, this control has been extended to the temporary storage used by the SQL Query Engine (SQE).

SQE holds jobs that exceed MAXTMPSTG

IBM i 7.2: MF64627, MF64628, MF64629, MF64630, SI66859, SI66860

IBM i 7.3: MF64447, MF64448, MF64449, MF64450, SI66499, SI66500

If job-specific temporary storage limits are already a part of your system management strategy, pay attention: when you apply these PTFs, you might see the database begin to hold jobs that previously ran smoothly. There are good reasons for this, and you need to understand why and when this happens.

On the other hand, if you’ve never used these limits, take this opportunity to consider whether to establish temporary storage maximums to protect your IBM i system. Become your system’s superhero!

Know and set your limits

SQL temporary storage usage problems can be caused by poorly written ad-hoc queries or they can simply be the effect of big data getting bigger. Regardless of the cause, no one wants to see their system slow to a crawl because one (or several) queries went wild and consumed all available storage. The query engine has always contained some simple safety measures to reserve a small amount (approximately 1%) of system storage. However, these limits are last resort efforts to avoid causing the system to crash, and they are of little use to system administrators and database engineers who want to fine-tune their systems. Effectively avoiding the problem of runaway queries is the purpose of the recent PTFs.

Temporary storage limits can be set for an individual job with the maximum temporary storage (MAXTMPSTG) parameter on the Change Job (CHGJOB) command. To apply a limit to multiple jobs, you can use the Create Class (CRTCLS) or Change Class (CHGCLS) commands to set the MAXTMPSTG parameter for the appropriate class associated with a subsystem description’s routing entry class. When creating or modifying classes within IBM Navigator for i, this is the Maximum temporary storage field. The value (in megabytes) assigned to this parameter will limit the amount of temporary storage a job can use. If the limit is reached, the job is held and a CPI112E message, “Job … held by the system, MAXTMPSTG limit exceeded.” is sent to the QSYSOPR message queue and to the job log. The job remains held until someone intervenes by ending the job or by raising (or removing) the MAXTMPSTG limit for that job and then releasing the job. A value of *NOMAX means that the job’s temporary storage is not limited; these jobs will see no change in behavior.

By putting the SQL engine under the scope of the MAXTMPSTG parameter, you now have effective controls for reining in runaway SQL queries consuming excessive storage. This is great news for managing your system, but there are some differences in how and when SQE decides whether to hold a job.

For the remainder of this article, we’ll refer to the storage used by the query engine to run the query as SQE temporary storage. In terms of global storage buckets, SQE temporary storage may be allocated from *DATABASE Segment Cache, *DATABASE SQE Heap, *DATABASE DSI SQE MIT, and *DATABASE DS SQE LOB buckets. The other allocations, which have always been limited by MAXTMPSTG, will be referred to as non-SQE temporary storage. Now let’s look at how MAXTMPSTG applies uniquely to SQE temporary storage.

Difference 1: Conditional activation

The first difference is that the limit is only triggered for SQE temporary storage once the amount of free space in the system auxiliary storage pool (ASP), also known as SYSBAS, falls below the auxiliary storage lower limit (QSTGLOWLMT) system value. This approach is used by SQE to minimize disruption to existing workloads caused by the behavior change. By default, QSTGLOWLMT is 5%, which is the amount of unused (available) space in SYSBAS. This means that when the system ASP usage exceeds 95% (on a default configuration), the new behavior will engage. To find out how much system storage is in use on your system, run the Work with System Status (WRKSYSSTS) command and note the “% system ASP used” as shown in Figure 1. This value can then be compared to the output of the DSPSYSVAL SYSVAL(QSTGLOWLMT) command.

Figure 1. WRKSYSSTS showing 39% system ASP usage
alt

You can also obtain this information with an SQL query.

Listing 1. Query system storage usage and limits

WITH SYSVAL(LOW_LIMIT) AS (
SELECT CURRENT_NUMERIC_VALUE/10000.0 AS QSTGLOWLMT
  FROM QSYS2.SYSTEM_VALUE_INFO
  WHERE SYSTEM_VALUE_NAME = 'QSTGLOWLMT'
)
SELECT SYSTEM_ASP_USED, 
       DEC((100.00 ‑ LOW_LIMIT),4,2) AS SYSTEM_ASP_LIMIT 
FROM SYSVAL, QSYS2.SYSTEM_STATUS_INFO ;

          

With QSTGLOWLMT set to 5%, the system in Figure 1 is in no danger of encountering this limit. This means that the MAXTMPSTG limit will act as it did before application of the PTFs—only excessive non-SQE storage will trigger the hold. But should the storage used exceed 95%, the new behavior will be activated. At that point, the system will hold every job which has MAXTMPSTG defined and which has combined SQE and non-SQE temporary storage exceeding this limit.

Note that in releases after IBM i 7.3, this behavior will change: the database will not use QSTGLOWLMT as a condition of enforcement. In future releases, if a job has a MAXTMPSTG limit defined, it will be held any time the job’s SQE and non-SQE temporary storage exceeds its limit.

Difference 2: Only active queries are counted

The second difference is that the SQE temporary storage accounting only considers allocations made by the currently active cursor. This is important because it can introduce some surprising results.

Unlike other temporary storage allocated for a job, the storage allocated by the SQL query engine can be reused by other jobs. As the first article in this series explains, the query engine will frequently retain certain objects in memory even after a cursor is closed. These objects can include temporary results and maintained temporary indexes (MTIs). Other jobs are free to access and use these objects for their queries. This is good for performance, but it becomes difficult to accurately assign ownership of the underlying storage.

To accommodate this complexity, the SQL query engine MAXTMPSTG implementation only considers the storage that is newly allocated for the query currently running. If cached objects are reused, these are not included in the MAXTMPSTG accounting. Similarly, once the query ends and its cached objects become available to other jobs, the storage allocated for those objects is no longer counted against the job’s MAXTMPSTG. This is true even though the storage remains allocated.

Let’s look at an example. The MAXTMPSTG limit for a job is set at 1 gigabyte (GB). As the job runs, it consumes 100 megabytes (MB) of non-SQE temporary storage. The job also runs three queries in sequence, each of which requires 400 MB of newly allocated SQE temporary storage. The query engine caches these objects for reuse. As a result, the job has increased system temporary storage usage by 1300 MB, which is greater than the 1 GB MAXTMPSTG limit. We might expect that running the third query would cause the job to be held. However, because only the active cursor’s storage is counted, at any point in the job’s run no more than 100 MB plus 400 MB was counted toward the limit. The calculated temporary storage used was never more than 500 MB, and the MAXSTGLMT was never reached.

On the other hand, if the job runs a fourth query requiring over 900 MB of temporary storage, the 1 GB limit will be exceeded. The job will run the query until the storage limit is met, and then it will be held. The job will remain held until it is ended or until the limit is increased and the job is released. Note that canceling the query will not release the job. The cancel may be requested while the job is held, but the job must be released before the cancel request is processed.

At this point, it might seem that we are not much better off than before we applied the PTFs. After all, couldn’t the job just keep running 400 MB queries and run the system out of storage anyway? It is important to remember that there are other internal limits controlling the caching of temporary query objects. While the details are beyond the scope of this article, it is sufficient to note that the SQL plan cache will not grow unbounded. Eventually some of the allocated objects will be purged to make room for the newer queries. The situation with MTIs is a little different, but a previous article in this series gives guidance on how to monitor these.

In most cases, the queries that plague system administrators and database engineers are the ones that without warning consume multiple gigabytes of storage. It is these monster queries that the enhanced MAXTMPSTG support is intended to prevent.

Putting it to use

Having considered the way that the SQE temporary storage accounting works, we can develop a general strategy for using the new MAXTMPSTG support. If you’ve never used MAXTMPSTG before (and even if you have) you’ll want to think about the jobs you want to limit and how much storage would be considered excessive. Well-behaved applications that use predictable queries might not need any limiting. But applications and interfaces that allow ad-hoc queries are susceptible to runaway allocations and will merit closer evaluation. Consider how much main store memory is allocated to the pool that the jobs are running in. How much storage is available in ‘‘SYSBAS? How many of the susceptible jobs run simultaneously? Remember that the intended purpose of these PTFs is to catch runaway jobs. You don’t want a limit so high that the job (or several jobs combined) can fill up ‘‘SYSBAS without encountering the limit, but neither do you want to set the limit so low that you are constantly having to intervene and release jobs.

If you do stop a runaway query in its tracks, consider it an opportunity to delve deeper. Avoid the temptation to end the job and move on. Instead, find out why the query is consuming so much storage. Is the query trying to do too much? Did the query environment change? Would the right index enable the optimizer to choose a plan that uses fewer query runtime objects? If you need help, see the resources on the right side of this page for more information about improving your SQL performance.

Planning ahead

If you currently use the MAXTMPSTG parameter for any jobs, IBM recommends that you evaluate the state of your system storage before applying the PTFs in Table 1. Take some time to look at the “% system ASP used” and QSTGLOWLMT for each of your systems. (See Listing 1 above.) Gather a list of jobs and classes that have MAXTMPSTG set. If it looks like a system is approaching QSTGLOWLMT, you must be ready to quickly respond to any jobs that are held. Knowing what these jobs are and having a plan in place to address the possibility that they might be held is key to successfully applying these PTFs.

To assist you in your preparations, look at the following code samples. They combine several of the IBM i services from IBM Db2® for i and provide a good start for querying the state of jobs on your system.

With Listing 2, you can find jobs currently running on your system that have MAXTMPSTG set and that have used SQL. The output shows you how much non-SQE temporary storage is currently being used and what is the current MAXTMPSTG setting. It also shows how much “headroom” exists between the job’s currently used non-SQE storage and the storage limit. This is the amount of space that the query engine will be allowed to use. Note that even with the PTFs in Table 1, the TEMP_STORAGE_CONSUMED value only includes non-SQE storage. Even though it counts against MAXTMPSTG, SQE temporary storage is not visible on a per-job basis.

Listing 2. Find jobs running with MAXTMPSTG limits

WITH LIMITED_JOBS (JOB_NAME, TEMP_STORAGE_CONSUMED, MAXTMPSTG) AS (
SELECT AJ.JOB_NAME, TEMPORARY_STORAGE, MAXIMUM_TEMPORARY_STORAGE_ALLOWED 
    FROM TABLE (QSYS2.ACTIVE_JOB_INFO()) AJ 
    INNER JOIN TABLE(QSYS2.JOB_INFO(JOB_STATUS_FILTER=>'ACTIVE', 
                                    JOB_USER_FILTER=>'ALL')) JI
     ON AJ.JOB_NAME = JI.JOB_NAME
     WHERE AJ.JOB_TYPE <> 'SYS'
     AND  MAXIMUM_TEMPORARY_STORAGE_ALLOWED IS NOT NULL) 
SELECT JOB_NAME, 
       TEMP_STORAGE_CONSUMED, 
       MAXTMPSTG, 
       MAXTMPSTG ‑ TEMP_STORAGE_CONSUMED AS HEADROOM, 
       B.* 
FROM LIMITED_JOBS, TABLE(QSYS2.GET_JOB_INFO(JOB_NAME)) B
WHERE V_SQL_STATEMENT_TEXT IS NOT NULL
ORDER BY HEADROOM ASC;

After you’ve applied the PTFs, you might find it helpful to monitor your system to determine whether any jobs have been held due to MAXTMPSTG. That is the purpose of the procedure given in Listing 3. This procedure returns a list of all jobs that are currently held and have a CPI112E message at the end of their job logs. The CPI112E message indicates that the cause of the hold was MAXTMPSTG.

Listing 3. Find jobs that are currently held due to MAXTMPSTG

SELECT AJ.JOB_NAME,‑‑ THE NAME OF THE JOB THAT IS HELD
   AJ.TEMPORARY_STORAGE, ‑‑ THE NON‑SQE TEMPORARY STORAGE THAT IS ALLOCATED TO THE JOB
   JI.MAXIMUM_TEMPORARY_STORAGE_ALLOWED, ‑‑ THE CURRENT VALUE OF MAXTMPSTG FOR THE JOB
   (SELECT INFO.V_SQL_STATEMENT_TEXT ‑‑ THE CURRENLY RUNNING SQL STATEMENT. 
    FROM TABLE(QSYS2.GET_JOB_INFO(AJ.JOB_NAME)) AS INFO
 WHERE INFO.V_SQL_STMT_STATUS='ACTIVE') AS ACTIVE_SQL_STATEMENT
FROM 
   TABLE(QSYS2.ACTIVE_JOB_INFO()) AJ 
 INNER JOIN 
   TABLE(QSYS2.JOB_INFO(JOB_STATUS_FILTER=>'ACTIVE', JOB_USER_FILTER=>'ALL')) JI 
 ON AJ.JOB_NAME=JI.JOB_NAME 
WHERE AJ.JOB_STATUS = 'HLD' ‑‑ ONLY SELECT JOBS THAT ARE HELD
 AND AJ. JOB_TYPE <> 'SYS' ‑‑ AND ARE NOT SYSTEM JOBS
 AND EXISTS (‑‑ AND THAT HAVE A CPI112E (MAXTMPSTG EXCEEDED) WITHIN THE
 SELECT '*' ‑‑ LAST 5 MESSAGES IN THE JOB LOG.
 FROM TABLE(QSYS2.JOBLOG_INFO(AJ.JOB_NAME)) AS JLI1 
 WHERE MESSAGE_ID = 'CPI112E'
 AND ORDINAL_POSITION + 5 >  
 (SELECT MAX (ORDINAL_POSITION) 
 FROM TABLE(QSYS2.JOBLOG_INFO(AJ.JOB_NAME)) AS JLI2) 
);
 

Or you may want to get a historical perspective. Listing 4 shows how to query QSYSOPR to find out all the jobs that have been held today because they reached the MAXTMPSTG limit. Note that both Listing 3 and Listing 4 return all jobs held for excessive temporary storage; they do not distinguish those stopped due to the earlier MAXTMPSTG behavior and those stopped because of the new SQE temporary storage protection.

Listing 4. Find jobs held today due to MAXTMPSTG protection

SELECT SUBSTR(MESSAGE_TEXT, 5, 23) AS HELD_JOB,
        A.* FROM QSYS2.MESSAGE_QUEUE_INFO A
      WHERE MESSAGE_QUEUE_NAME = 'QSYSOPR' AND 
            MESSAGE_QUEUE_LIBRARY = 'QSYS' AND
            MESSAGE_ID = 'CPI112E' AND
            DATE(MESSAGE_TIMESTAMP) = CURRENT DATE
      ORDER BY MESSAGE_TIMESTAMP DESC;

As you evaluate your existing configuration, remember that in releases after IBM i 7.3, the database will not use QSTGLOWLMT as a condition for enforcing MAXTMPSTG. If you currently use temporary storage limits, now is the time to start evaluating whether those limits need to be raised to accommodate SQE temporary storage allocations.

Full steam ahead

Quick thinking, web-shooters, and super-human strength enabled Spider-Man to save a train full of terrified passengers. Your “super-powers” may be less glamorous but they are no less essential for keeping your systems running well. With the recent enhancements from the Db2 for i team and the information in this series of articles, you have more tools to diagnose, solve, and—most importantly—prevent serious systems emergencies.