Join us for Code @ Think 2019 | San Francisco | February 12 – 15 Register now Limited availability
By Bill Wilkins | Published March 11, 2004 - Updated March 10, 2004
The insertion of rows is one of the most common and important tasks you will perform when using a IBM® DB2® Universal Database (UDB). This article is a compilation of techniques for optimizing the performance of inserts, particularly high volume inserts. As in most any performance discussion, there are tradeoffs. We’ll discuss the tradeoffs that optimizing inserts can introduce. For example, a technique you use for inserts may require additional processing following the inserts, or may impact query performance. I will provide some performance measurement results to give you an idea of the significance of many of the optimization techniques. Appendix A contains a numbered summary of the results, and throughout the article I’ll refer to the results by their test numbers. In the Conclusion section below is a summary of the most beneficial techniques, and Appendix B lists all of the suggestions made in the paper. Although we won’t be examining complete details on how to implement the techniques, this information is available in the DB2 manuals unless otherwise indicated. Please go to the References for more details.
Let’s start by taking a simplified look at the processing steps for an insert of a single row. Each of these steps has potential for optimization, which we’ll discuss later in sequence.
There are also numerous types of additional processing that may take place, depending on the database configuration, for example, the existence of indexes or triggers. This extra processing can be highly significant to performance and will be discussed later.
Before we discuss optimizing inserts in detail, let’s consider the alternatives to inserts: load and import. The import utility is essentially a front end for SQL INSERTs, but some of its capabilities may be useful to you. Load also has useful extra features, but the main reason for using load instead of inserts would be improved performance. We won’t discuss import further here, but see the References section below for a pointer to the Data Movement Utilities Guide discussion of how import and load differ.
Load formats data pages directly, while avoiding most of the overhead of individual row processing that inserts incur (for example, logging is virtually eliminated). Also, load can better exploit parallelism on multiprocessor machines. Two new capabilities in V8 load are especially interesting in terms of being alternatives to insert: load from a cursor, and load from call level interface (CLI) applications.
This approach can be used in application programs (via the db2Load API), or DB2 scripts. As an example of the latter: declare staffcursor cursor for select from staff; load from staffcursor of cursor insert into myschema.new_staff;`declare staffcursor cursor for select from staff; load from staffcursor of cursor insert into myschema.new_staff;<br/> Those two lines could replace: <br/>insert into myschema.new_staff select * from staff` Tests 6 and 9 in Appendix A showed that loading from a cursor provided almost 20% improvement over the equivalent INSERT … SELECT statement.
<br/> Those two lines could replace: <br/>
This approach is obviously limited to call level interface (CLI) programs, but it’s very fast. The technique is very similar to array inserts ( discussed later), and a sample is provided with DB2; see sqllib/samples/cli/tbload.c . Looking at test 79 in Appendix A, we see that using load can be over twice as fast as fairly optimized array inserts (test 71), and almost ten times faster than less optimized array inserts (such as test 69).
Let’s look at the essential steps of insert processing and the techniques you can use to optimize them.
Being an SQL statement, an INSERT statement must be compiled by DB2 before it’s executed. This can take place automatically (e.g., in CLP, or a CLI SQLExecDirect call), or explicitly (e.g., through an SQL Prepare statement, CLI SQLPrepare, or JDBC prepareStatement). This compilation process involves authorization checking, optimization, and other activities necessary to convert the statement into an executable form. When the statement is compiled, its access plan is stored in the package cache.
If the same INSERT statement is executed repeatedly, its access plan will (usually) be found in the package cache and the compilation cost is avoided. However, if the insert contains literal values that differ for each row, each statement is treated as unique and must be separately compiled. It is highly desirable to replace a repeated statement such as: insert into mytable values (1, ‘abc’) insert into mytable values (2, ‘def’)insert into mytable values (1, 'abc') insert into mytable values (2, 'def') etc. with one that has parameter markers and is prepared once and repeatedly executed: insert into mytable values (?, ?)
insert into mytable values (1, 'abc') insert into mytable values (2, 'def')
insert into mytable values (?, ?)
Comparing test 1 with test 2, and 61-64 with 65-68, we see that the use of parameter markers can make a series of inserts run several times faster. (Using host variables in a static SQL program would have a similar benefit.)
There are several different optimization techniques that fall into this category. The most important of them is to include more than one row per insert, which avoids costly per-row client-server communication, as well as DB2 overhead. The techniques available for doing multi-row inserts are:
If it’s not possible for you to pass multiple rows in one insert, the next best thing is to combine multiple inserts into a group that all get passed together from the client to the server. (However, this implies that each insert will contain literal values and will thus be prepared, so performance will be substantially worse than if parameter markers were used, as discussed above in “Statement preparation”.) Combining statements into one is done through Compound SQL:
Here are some other suggestions on this topic:
DB2 uses one of three algorithms for deciding where to insert a row. (If Multi-dimensional Clustering (MDC) is in use, it’s a different story, but we won’t cover that here.) For details on the insert algorithms, see the referenced location in the DB2 V8 Administration Guide: Performance.
The default mode is for DB2 to search the Free Space Control Records (FSCR’s) interspersed in a table’s pages to find a page that has enough free space to hold the new row. Obviously, if there is relatively little free space on each page, much of the search time will be wasted. To counter this, DB2 provides the DB2MAXFSCRSEARCH registry variable to allow you to limit the search to fewer than the default of 5 pages.
The second algorithm is used when the table is placed in APPEND mode through ALTER TABLE. This avoids the FSCR search entirely, as rows are simply placed immediately at the end of the table.
The final algorithm is used when the table has a clustering index; in which case DB2 tries to insert each row on a page with similar key values (according to the clustering index). If there’s no room there, it tries nearby pages, and if there’s no room there either, it does an FSCR search.
If optimizing insert time is the only consideration, using APPEND mode is the fastest approach for bulk inserts, but its effect is not nearly as profound as that of many of the other optimizations discussed here; see tests 22 and 23. The next best approach should be the default algorithm, but changing the value of DB2MAXFSCRSEARCH had negligible impact in the test environment; perhaps in a less I/O constrained environment that change would have been significant.
Having a clustered index has a significantly negative impact on insert performance (tests 32 and 38 showed almost a 20% overhead), which is not surprising given that the purpose of such an index is to improve query (i.e., Select) performance by doing extra work at insert time. If having a clustered index is needed, its effect on inserts can be minimized by ensuring that sufficient free space exists: use ALTER TABLE to increase PCTFREE, then REORG to reserve the free space. Allowing too much free space, however, could cause extra pages to be read for queries and largely defeat the purpose of using a clustering index. Alternatively, dropping a clustered index before bulk inserts, and recreating afterwards, may be the optimal approach (there is no more overhead for creating a clustered index than a regular index, just extra overhead at insert time).
Every insert is performed by storing the new row on a page and eventually writing that page to disk. Once the page is identified, as discussed above, it must be available in the buffer pool before rows can be added to it. For bulk inserts, most pages will be newly assigned to the table, so let’s focus on the handling of new pages.
When a new page is required and the table is in a System Managed Storage (SMS) tablespace, by default each page is allocated individually from the filesystem. However, if the db2empfa command has been run against the database, each SMS tablespace has new pages allocated for it an extent at a time. Tests 11 and 82 show that running db2empfa can make inserts to SMS about twice as fast when the extent size is the default of 32 pages, because in test 82 there are 32 allocations of one page instead of the one allocation of 32 pages in test 11. Tests 11 and 83-85 show that extent sizes smaller than 32 make performance progressively worse because of the extra allocations, but suggest that extent sizes larger than 32 are unlikely to help much. The suggestion is to run db2empfa and use an extent size of 32.
With a Database Managed Storage (DMS) tablespace, space is pre-allocated when the tablespace is created, but extents of pages are assigned to tables during insert processing. The pre-allocation of space through DMS provided about a 20% benefit vs. SMS — see tests 11 and 81. Test 81 used a DMS file tablespace, and a small additional benefit would probably have occurred if a DMS raw tablespace had been used. There was no apparent effect from varying extent size with DMS.
If there are indexes on the table, an entry is added to each index for each inserted row. This requires the appropriate index pages to be in the buffer pool. Later on we’ll discuss index maintenance, but for now just keep in mind that the buffer pool and I/O considerations for inserts apply in a similar manner to index pages as they do data pages.
As an insert progresses, more and more pages will be filled with inserted rows, but DB2 does not require any of the new or updated data or index pages to be written to disk in order for either the insert itself or the subsequent Commit to complete. (That’s because of DB2’s writeahead logging algorithm. There’s an exception that is covered below in the section on Logging.) However, the pages will need to be written to disk at some point, which could be as late as during database shutdown.
In general for bulk inserts, you will want asynchronous page cleaning to take place quite aggressively so that there are always free slots in the buffer pool for new pages. The rate of page cleaning, or the total absence of it, can cause large differences in timings which can make performance comparisons misleading. For example, with a buffer pool of 100,000 pages and no page cleaning, a bulk insert may complete without any of the new or changed (“dirty”) pages being written to disk, but subsequent operations (such as selects or even a database shutdown) may be greatly delayed while up to 100,000 dirty pages from the inserts are written to disk. On the other hand, if aggressive page cleaning is done in the same situation, the bulk insert may take longer but afterwards there will be fewer dirty pages in the buffer pool, resulting in better performance for subsequent tasks. It may not always be clear which of those results is truly better, but for very large bulk inserts it is usually impossible to store all of the dirty pages in the buffer pool, so effective page cleaning is a necessity for optimal performance.
For the best possible page cleaning:
As far as I/O itself is concerned, it can be minimized when there are indexes that need to be built, by using as large a buffer pool as possible (see the section ” Index maintenance” below.) If there are no indexes, a large buffer pool might not help much other than in delaying I/O; that is, it may allow all new pages to fit in the buffer pool, but eventually those pages will still need to be written to disk.
When I/O for writing pages does occur, it can be made faster by normal I/O tuning steps, such as:
By default, each inserted row has an X lock held on it from the time the row is created until the insert is committed. There are two performance issues relating to inserts and locking:
For a well optimized bulk insert, the CPU cost of obtaining and later freeing an X lock on every row can be significant. The only alternative to a lock on every new row is a table lock (since there are no page locks in DB2). Tests 11 and 101 showed a 3% reduction in elapsed time when table locking was used. There are three ways to cause a table lock, which we’ll now look at, deferring discussion of the disadavantages for a minute:
The disadvantage of all of these, of course, is the concurrency impact: if an X lock exists on the table, no other application can access it unless that application uses isolation level UR (uncommitted read). If you know that exclusive access will not cause problems, by all means use table locking. However, even if you stick with row locking, keep in mind that during a bulk insert you may have thousands of new rows with X locks in the table, so there may be conflicts with any other applications using the table. These conflicts can be minimized in several ways:
By default, each insert is logged for recovery purposes. Log records are written first to the log buffer in memory, and then to the log file, most commonly when the log buffer is full or a commit occurs. Optimizing the logging of bulk inserts is a matter of minimizing the number of log writes and making the writes as fast as possible.
The first thing to consider here is the log buffer size, controlled by the database configuration parameter LOGBUFSZ. It has a default value of 8 pages, or 32K, which is smaller than ideal for most bulk inserts. Let’s take an example. For a bulk insert with 200 bytes logged per row, the log buffer will fill after about 160 rows have been inserted. If there are 1000 rows being inserted, there will be about six log writes due to the log buffer being filled, plus one for the commit. By raising the value of LOGBUFSZ to 64 (256K) or more, the log buffer will not fill, and there will only be one log write (at commit) for the bulk insert. Tests 104 and 105 showed about a 13% improvement from using a larger LOGBUFSZ. The downside of a larger log buffer is that crash recovery can take slightly longer.
Another possibility for minimizing log writes is to use “ALTER TABLE ACTIVATE NOT LOGGED INITIALLY” (NLI) for the table being inserted into. When that’s done, none of the inserts will be logged for the duration of that same unit of work, but there are two important issues related to NLI:
As far as speeding up log writes is concerned, some possibilities are:
Commit forces log records to disk, thus ensuring that the inserts to that point are guaranteed to be in the database, and frees the locks on the new rows. Those are valuable activities, but since a Commit always involves a synchronous I/O (to the log) and insert doesn’t, the overhead of a Commit can easily be higher than that of an insert. Therefore, committing after every row in a bulk insert environment is very bad for performance, so make sure you don’t use autocommit (which is the default for CLI and the CLP). Committing after every 1000 rows or so is recommended: tests 61-78 show up to about a ten times performance improvement when committing every 1000 rows instead of every one or two. Committing more than 1000 rows at a time will provide only small savings and increase recovery time if a failure occurs.
A slight amendment to the above: If the MINCOMMIT database configuration parameter has a value higher than 1 (the default), DB2 does not necessarily do a synchronous I/O for each commit, but waits and tries to share log I/O among a group of transactions. This is beneficial for some environments, but in general will have a neutral or negative effect on bulk inserts, so MINCOMMIT should be kept at 1 if bulk inserts are the key task being performed.
There are several types of processing that will occur automatically for an insert, when they apply. If your main objective is simply to minimize insert time, the simplest approach is to avoid all of these overheads, but that may not be desirable in the overall context. Let’s discuss these in turn.
For each row inserted, an entry must be added to each index (including any primary key index) on the table. There are two main costs for this process:
The worst-case scenario is one where there is significant random I/O during index maintenance. Suppose there are 10,000 rows being inserted, 50,000 leaf pages in the index, and 5000 pages in the index’s buffer pool, and the key values for the rows being inserted are randomly spread over the entire key range. As many as 10,000 leaf pages (and possibly some non-leaf pages), will need to be brought into the buffer pool to be searched and/or updated, and there’s only a 10% chance a given leaf page will already be in the buffer pool. The high probability of a disk read being necessary for each insert makes this scenario prone to poor performance.
On a per-row basis, it is much more expensive to add rows to an existing index than it is to create a new index. If you are inserting to an empty table, you should always create indexes after doing volume inserts. (Note that when you use load, indexes should be created beforehand). If you are inserting to a populated table, dropping indexes before doing volume inserts, and recreating the indexes afterwards, can be the fastest approach, but only if a relatively large number of rows is being inserted — probably more than about 10-20% of the table size. You can help speed up index maintenance by using a large buffer pool for the index tablespace, and, if possible, ordering the inserts so that the key values are ordered, not random.
Tests 31-37 in Appendix A give some results for how insert time can vary with the number of indexes and when they’re created. In a nutshell, inserts can take multiple times longer with four indexes than no indexes, and deferring index creation until after inserts can cut the total time (for inserts plus index creation) by from one-quarter to almost one-half.
If maximizing insert performance is the key objective, increasing PCTFREE for the index may reduce the number of page splits that occur during the random insertion of index entries. This must be done carefully, however, because too much free space will mean an abundance of index pages, which could negatively impact query performance and even insert processsing itself.
This category of overhead includes CHECK constraint verification and foreign key constraint (referential integrity (RI)) verification. CHECK constraints have quite low overhead (see tests 11-13 in Appendix A), largely because no I/O is needed for each row (since the value(s) to be checked are within the row, although there may be some computation to be done).
Inserting a row when there are foreign keys is a different matter. For each foreign key, a lookup must be done in the parent table(s) to ensure that a parent row exists. Although this lookup is done via the primary key index, it does take CPU cycles to do the search, and possibly I/O to read index page(s) into the buffer pool. Tests 11, 14, and 15 show that having two foreign keys can more than double the insert time.
In the same way that using CREATE INDEX after a table is populated is cheaper than building an index via individual INSERT statements, using an ALTER TABLE to create a foreign key (i.e., do the verification) as a bulk operation is cheaper than the total incremental cost of the verification during each insert. Comparing tests 14 and 16, and 15 and 17, we see that creating the foreign key(s) after inserts can reduce the total elapsed time by roughly 40%.
If possible, before doing mass inserts into a table you should remove all constraints on it using ALTER TABLE, and recreate them afterwards (again using ALTER TABLE), unless you will be inserting fewer than about 10-20% of the rows in the table.
Some applications do their own checking to ensure that relationships between tables are valid; that is, they might read a parent row to make sure it exists, before inserting a child row. If such checking is done correctly, having foreign key constraints defined in the database adds extra overhead. However, there are at least three reasons why defining the constraints in the database is better:
If application checking can’t be removed, the best alternative is define the foreign keys in DB2, but with the NOT ENFORCED clause in CREATE TABLE or ALTER TABLE, so that the checking overhead is avoided but the optimizer can use the relationship knowledge.
If one or more insert triggers are defined on a table, each insert will result in the execution of the actions prescribed in the trigger definition. Since those triggered actions are typically one or more INSERT, UPDATE, or DELETE statements, the overhead of triggers can be substantial in a mass insert operation. As shown by the results of tests 18-21 in Appendix A, adding triggers can cause insert performance to be made several times slower. Unlike the temporary removal of indexes and constraints, however, application knowledge is required to know when it is acceptable to avoid trigger execution; that is, when it will not cause data integrity issues. Where it is acceptable, you may wish to temporarily disable triggers using one of the techniques described in the article listed in the References section.
These are two approaches to having DB2 automatically generate integer column values, typically during inserts. The main performance issue to be aware of for identity and sequence is that, for recoverability reasons, generated values must be logged. To reduce the log overhead, the values can be pre-reserved (cached), and one log record is written each time the cache is exhausted. The default is to cache 20 value.
As the results of tests 41-51 in Appendix A show, not having a cache is extremely costly (almost nine times slower than the default), while using a cache size larger than the default can cut the time by more than half and reduce the overhead of using Identity or Sequence to less than 20%. If you have a choice to make between using Identity or Sequence, Identity performs up to a few percent better.
When inserting rows into a table with one or more columns defined with a “generated as” clause, such as:CREATE TABLE t1 (c1 CHAR(5), c2 CHAR(5) generated always as (UPPER(c1))) additional overhead results from the function call(s) to establish the generated value(s). However, the overhead is very small and should not deter you from using this functionality.
CREATE TABLE t1 (c1 CHAR(5), c2 CHAR(5) generated always as (UPPER(c1)))
An MQT can be used to enhance query performance through the pre-computation of aggregate values. For example: create table staffsum as ( select count(salary) as salcount, sum(salary) as salsum, dept from staff group by dept ) data initially deferred refresh immediatecreate table staffsum as ( select count(salary) as salcount, sum(salary) as salsum, dept from staff group by dept ) data initially deferred refresh immediate If the MQT is defined as “refresh immediate”, the aggregations in the MQT will be re-computed at the time of each insert, so in general it is not a good idea to do volume inserts against a table with refresh immediate MQT’s. However, DB2 will optimize the re-computation as best it can, such as by scanning the summary table instead of the entire base table. We suggest running Explain against your inserts that involve MQT’s, to be aware of what’s happening behind the scenes.
create table staffsum as ( select count(salary) as salcount, sum(salary) as salsum, dept from staff group by dept ) data initially deferred refresh immediate
When inserting a row into a partitioned table (with the DB2 V8 Data Partitioning Feature (DPF)), the partitioning key value for the row is hashed to determine the partition that the row must be inserted into, and then DB2 sends the row to that partition. This one-row-at-a-time processing is slow, as we’ve seen before for CLI when compared with array inserts, and is made worse by the need to send each row from the coordinator partition to the target partition.
To minimize the overhead you can use buffered inserts. When this approach is in use, hashed rows destined for a given partition are placed in a buffer and sent to that partition as a group instead of one at a time. You can cause buffered inserts to be done by using the “INSERT BUF” option of the Prep or Bind command. For details on buffered inserts, see the reference given for the Application Development Guide: Programming Client Applications. There is also a Developer Domain article on buffered inserts with Java (SQLJ only); see the References below.
In cases where the absolute maximum performance is required for repeated bulk inserts in a DPF environment, you may want to explore two related DB2 API’s. The first, sqlugtpi, lets your application get the partitioning information for a table. That information can then be used with the sqlugrpn API to find which partition number a row belongs to. You would use these API’s to group all data for a given partition together, then connect to that partition so that no transfer of data between partitions needs to be done, and repeat this for each partition. This approach can achieve very fast performance but does require some effort in ensuring that it works perfectly despite multiple data types, code pages, and so on.
Another problem that can arise with DPF inserts is if there are foreign key constraints on the table being inserted into (the child table). Suppose the parent and child tables have different partitioning keys. Then each child row’s parent will in general be on a different partition, so for most inserted rows, the parent verification check will need to cross from the child partition to the parent partition. Solutions for this include giving the parent and child tables the same partitioning key (which may not be the best choice for query performance), or, if multiple logical partitions are in use, settting the DB2 registry variable DB2_FORCE_FCM_BP to YES.
The use of a staging table can provide improved performance for some insert scenarios. The typical usage is that instead of doing bulk inserts into a table, the LOAD command is used to load rows into a staging table; from there, INSERT … SELECT can be used to insert rows into the main table. Both LOAD and INSERT … SELECT are much faster than regular inserts, and even the two steps together are often faster than regular inserts. However, for performance alone, in V8 it’s usually better to use load directly into the main table, because load doesn’t have the concurrency restrictions it had in V7.
Aside from the massaging of data, the main reason you might still want to use a staging table in V8 is to break up the bulk insert into smaller pieces that can be run in parallel. On a system with multiple processors, each insert will run within a single DB2 agent and will not exploit more than one processor, even if the DBM INTRA_PARALLEL parameter is set to ON. For example, if you are inserting 1M rows on an 8-way machine, normal insert processing will typically not utilize much more than about 12% (100 / 8) of the CPU. (Load, on the other hand, will automatically use a much higher percentage of the CPU, which is another reason it is preferred vs. inserts.) Instead, you could load the 1M rows into a staging table, then run eight concurrent insert … Select statements, with predicates in the eight Selects chosen to each retrieve a unique subset of about 1/8 of the rows in the staging table.
Finally, you could run concurrent inserts through a multithreaded application, with each thread doing its own inserts.
For an INSERT … SELECT, anything that makes the select faster may reduce the elapsed time of the whole statement, but it’s beyond the scope of this document to go into much detail. Some possibilities are as follows. Note that these apply to the select portion only; the same factors would in general not help the insert portion.
These types of columns are unique in that they are not cached in the buffer pool. Consequently, any insert that includes one or more of these columns results in them being written out to disk immediately, which in DB2 terminology is a “direct write”. As you can imagine, this makes LOB/LONG inserts much slower than “normal” inserts: test 91, using a CLOB column, was over 9 times slower than the baseline test (11, with a CHAR column). Some optimization possibilities are:
For a simple insert with no constraints, a test of changing the optimization class from the default (5) to 1 did not result in a significant change in performance, despite the fact that this changed the optimizer’s algorithm to a cheaper one. There may be a benefit in using a lower optimization class when inserts are prepared frequently, involve constraints, or have selects. Conversely, a small percentage of inserts that have a complex select in them will benefit from raising the optimization class from 5 to a higher value.
A fairly common database task is updating a target table from a source table. A specific case is to take each row in the source table and if it does not already exist in the target table, insert it into the target; otherwise update the target row. In V8 you can use the MERGE statement to do this in one statement rather than many executions of separate statements, and performance should be better.
As you attempt to monitor and tune inserts, the basic task is no different from that of most other performance analysis: identify what the bottleneck is and deal directly with it. Determining the bottleneck is first a matter of using operating system tools to look at CPU, I/O, memory, and network consumption. This should let you rule out certain areas and focus on one or two. It’s beyond the scope of this paper to discuss operating system facilities further.
Try not to get sidetracked on extraneous issues. For example, if CPU utilization is at 100%, making a change to reduce I/O is probably not going to improve performance, and it may discourage you from making the same change later when it could be very useful, after the CPU bottleneck has been removed.
You should try to have the application report the rate of inserts at various points during the operation. For example, it is useful to know if the number of inserts per second is the same at ten minutes into the run as it was at five minutes in. Often there is a brief initial slow period as the insert begins, then a fast period while the buffer pool fills up and no data page I/O is done. Then, when data pages start being written out, the rate can slow down somewhat, especially if page cleaning or the I/O subsystem is not optimal.
For very large bulk inserts there is typically a point at which a steady rate of inserts is reached. If this is not the case, it is typically because indexes are being built during the inserts, causing more and more index pages to be created over time and possibly requiring random I/O to read existing index pages for them to be updated. If this is the case, a larger buffer pool is the best solution, but adding more free space to index pages may also help.
Now let’s look at the key DB2 facilities to help you monitor and tune inserts: Snapshots, Event Monitoring, and Explain. For more information on snapshots and event monitoring, see the System Monitor Guide and Reference; for Explain, see the Administration Guide: Performance.
Snapshot monitoring provides numerous pieces of information describing what’s happening during insert processing. You can use the following steps to obtain all of the available information (or you can choose to get subsets of the information):
Most of the information relevant to inserts can be found in the database snapshot, and most of that information is also available with more granularity in the appropriate bufferpool, application, and tablespace snapshots.
The following are the most pertinent lines in the database snapshot:
Buffer pool data writes = 500
Asynchronous pool data page writes = 500
Buffer pool index writes = 0
Asynchronous pool index page writes = 0
Total buffer pool write time (ms) = 25000
Total elapsed asynchronous write time = 25000
LSN Gap cleaner triggers = 21
Dirty page steal cleaner triggers = 0
Dirty page threshold cleaner triggers = 0
Update/Insert/Delete statements executed = 100000
Rows inserted = 100000
Take a look at “Rows inserted” in consecutive snapshots to see if the rate of inserts varies during the course of a bulk insert. Most of the other values reflect the amount of I/O and the effectiveness of page cleaning. Regarding the latter, ideally you’ll see that all of the data writes are asynchronous, and all of the buffer pool write time is asynchronous (as is the case in the above output); if not, try lowering CHNGPGS_THRESH and/or increasing NUM_IOCLEANERS.
Additional information on the insert can usually be found in the dynamic SQL snapshot. It can be very useful to look at the total elapsed time and compare that with the user and system CPU times. Most of the difference between the elapsed and CPU times should taken up by I/O, so it may be clear where the bulk of the time is being spent and where to put your tuning effort.
Here’s a subset of a dynamic SQL snapshot entry for a CLI array insert of 100,000 rows. Note that the “Number of executions” is one per row, even though the application sent only 1/10 of that number of arrays.
Number of executions = 100000
Number of compilations = 1
Rows written = 100000
Buffer pool data logical reads = 102120
Total execution time (sec.ms) = 13.830543
Total user cpu time (sec.ms) = 10.290000
Total system cpu time (sec.ms) = 0.130000
Statement text = INSERT into test1 values(?, ?, ?, ?, ?)
Some additional information to look at in snapshot output:
DB2 event monitor lets you obtain performance information on events as they occur on the server, such as statement or transaction completion. For the purpose of insert performance analysis, you may want to create an event monitor for statements and activate it during a period of insert executions. Although it may be overkill, the event monitor output will show the elapsed time of each insert statement. For OLTP-like applications, the overhead of running event monitoring for statements is quite high, and the output voluminous, so be careful not to have the monitor running for too long; even a few seconds can produce megabytes of output. You could write the monitor information to a table to facilitate analysis of the results, such as the performance trend over time.
The following is a condensed sample statement event from a series of 10,000 CLI array inserts of 10 rows each. There is one event per array, in this case per 10 rows.
17) Statement Event ...
Appl Handle: 9
Appl Id: *LOCAL.wilkins.0953D9033443
Type : Dynamic
Section : 4
Creator : NULLID
Package : SYSSH200
Text : INSERT into test1 values(?, ?)
Start Time: 01‑28‑2004 22:34:43.918444
Stop Time: 01‑28‑2004 22:34:43.919763
Exec Time: 0.001319 seconds
Number of Agents created: 1
User CPU: 0.000000 seconds
System CPU: 0.000000 seconds
Fetch Count: 0
Rows read: 0
Rows written: 10
Internal rows deleted: 0
Internal rows updated: 0
Internal rows inserted: 0
Bufferpool data logical reads: 10
If the performance of your inserts is not as good as expected, it may be because there is “hidden” processing taking place. As discussed earlier, this processing could take such forms as index maintenance, constraint verification, or trigger execution. Running a form of Explain (such as Visual Explain, or the Explain statement plus db2exfmt) against your insert will reveal most of the extra processing (but not index maintenance). You may be able to eliminate the extra processing if it seems to be the cause of the performance problem.
As a simple example, the following graph (produced by db2exfmt) shows the access plan for an “Insert into Test1 values (?, ?, …)”. Can you guess what the extra processing is here? The answer is below.
FILTER TABFNC: SYSIBM
( 3) GENROW
( 5) ( 7)
1 116 120
TBSCAN TABLE: WILKINS INDEX: SYSIBM
( 6) TEST1 SQL0401290925513
Before I give the answer: Explain is also very useful for INSERT … SELECT statements. The inserts themselves could be very fast, but there may be an access plan problem for the SELECT which slows down the whole statement. This would be revealed by Explain. Answer: The above access plan is for an insert into the child table of a foreign key relationship. There is a nested loop join (NLJOIN, step 4) of the insert (step 5) to an index scan (IXSCAN, step 7). The latter is actually a primary key lookup against the parent table and constitutes the foreign key constraint verification. In this case there’s relatively little overhead from the extra processing: the cost of the index scan was estimated at 2.60196 timerons (timerons are cost units combining (primarily) CPU and I/O costs), while the insert itself had a cost of 25.5248 timerons. If there were a trigger, for example, you would see that reflected in one or more additional insert, Update, or Delete entries in the access plan.
By the way, when you see “GENROW” in an access plan, it is a “generate row” step. It represents the creation of temporary row(s) for use in subsequent steps, and is not something you need to worry about.
In this paper we have presented numerous ways to improve insert performance. See Appendix B for a complete list of them. The following is the most important subset, and each of these can in some cases make performance over twice as fast:
We hope that this article has given you a good idea of what happens during DB2 insert processing, as well as how to monitor and improve its performance.
This appendix illustrates the benefit of some of the optimization techniques discussed in this report (and the impact of non-optimal approaches). Each test was run at least twice and the results were consistent, but the results are not necessarily closely representative of those you would see in your environment. In particular, the results were obtained on a system that was more heavily I/O constrained than ideal, due to the user tablespace and log being in the same filesystem and the same two disks. As a result, improvements that reduce CPU overhead would typically have more benefit than reported here.
The following apply to all of these tests, except if noted otherwise:
There was a Commit for every 1000 rows.
Tests 1-3 had autocommit on and thus had a Commit for each insert. That is, there was a Commit for each row in tests 1 and 2, and for every 10 rows in test 3. For the performance impact of Commit, see tests 61-78.
Tests 6-9 timed the population of one table from another, including the time to Commit. Test 7 show that using NOT LOGGED INITIALLY (NLI) actually caused a slowdown in performance, because of the new pages needing to be written to disk at Commit time: the Commit took over half of the time. However, with the more aggressive page cleaning in test 8, performance was improved substantially, mostly through the Commit time being reduced by over five seconds. Test 9 showed a nice 17% improvement from using Load instead of insert … Select, without the risk of NLI as in test 6.
Clearly the use of check constraints has a minor effect on performance, but foreign keys and triggers can be extremely significant. In tests 18 and 19, each trigger inserted one row into a separate table in the same tablespace as the base table, and this exposed an I/O bottleneck on the database and log disks. In tests 20 and 21, each trigger incremented an “insert count” column in a one-row table; although there were no additional rows written out as there were in tests 18 and 19, the update overhead and logging still caused the inserts to run significantly slower than without any triggers. Although standard database I/O tuning would have improved tests 18-21, the point is that having constraints in place during mass inserts is something to be avoided if possible.
In tests 32-35 the indexes were created before the inserts, while in tests 36-37 the indexes were created after the inserts (and for the latter tests the elapsed times cover the inserts plus the CREATE INDEX statements).
Tests 41-51 show that using Identity or Sequence has a profoundly negative effect on performance if a small cache size, or no cache, is used, but using a large cache can make the overhead almost negligible.
The above tests show the great benefit of using large arrays and committing a large number of rows at a time, with the latter being more important.
The above tests show that for SMS, running db2empfa is critical to good insert performance, and that using a smaller extentsize than 32 pages is detrimental. DMS is somewhat preferable to SMS.
The above comparison shows the extreme impact of having a CLOB column, even a very short one. Variations of test 91 were tried, but the results were very similar, regardless of the choice of logged or not logged, or compact or not compact, for the CLOB column.
The above comparison shows that causing table locks to be used instead of row locks saved about 3% in elapsed time.
Test 105 showed about a 13% improvement from raising LOGBUFSZ high enough such that there were no log writes forced by having the log buffer fill up during the Inserts.
Test 111 used all of the optimizations tried above as alternatives to the baseline: DMS instead of SMS, arrays of 1000 rows (vs. 10), committing every 10000 rows (vs. 1000), LOCKSIZE TABLE, APPEND ON, and LOGBUFSZ 256. Note that test 22 was already somewhat optimized as compared to the original baseline (test 1). The time for test 1 has been extrapolated here to 1,000,000 rows to show the vast difference in performance that is possible with only a few factors being changed.
Here are the suggestions discussed in more detail earlier in this paper. Keep in mind that these have varying degrees of impact and some will have a negative impact on tasks other than Inserts.
Back to top