Perhaps regarded as one of the most troublesome business issues, database corruption often takes place subtly and affects businesses adversely. In simple words, corruption can be defined as any unintentional entry in database. Corruption issues may result in the severe performance hit of the system. In some cases, it may result in frequent system crashes and may become a business-critical downtime situation. Database corruption can occur at any layer, from DB2 to the operating system and right down to the hardware layer. Hence, it is important to understand and troubleshoot it by involving all possible affected layers and collecting any diagnostic data that might be required as soon as possible, while the data is available.
In this article, you will understand why a database can go offline when it encounters corruption. You will also learn to analyze symptoms for corruption and differentiate between easy-to-fix and catastrophic failures. This article will shed light on corruption issues when using IBM DB2 and help DB2 users understand and choose the best approach in dealing with such critical and high-impact issues.
This article begins with discussion about possible sources of corruption. It then explains the following tasks:
- Identifying and troubleshooting corruption — Identify and categorize corruption in the database when using DB2, with the help of sample symptom messages that appear in db2diag.log. Corruption issues can be broadly classified in five categories: data page corruption (or table corruption), index corruption, CBIT corruption, log corruption, and packed descriptor corruption.
INSPECTto identify corruption — Get insight into useful DB2 commands,
INSPECT, to check database corruption.
- Approaches to recover from corruption — Once a corruption issue is identified, how to approach these cases, what data to collect, and how to recover from the situation is crucial. Learn possible recovery approaches and how to choose between available options.
- Preventive strategies to avoid possible corruption — Best practices are discussed.
Database corruption can occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Some of the common reasons of corruption:
- Corrupt file system is one the most common reasons for corruption in a database. Abrupt system shutdowns, power surges, file system double-mount, migrating disks, file system-level activities like checking and repairing file system (using utilities like fsck on Linux®) when the database is up and running, and using Ctrl+Alt+Delete while a file is open, viruses can all introduce unintentional changes in database.
- Hardware failure.
- Memory corruption.
- DB2 defect.
- I/O and network issues (issues in fiber adapter, switches, etc.).
- Incorrect application coding.
- Inconsistency in the value of the page in the buffer pool (sqldPage) and the one stored in the file system.
- Overwriting of disk data causes corruption.
- User interference with critical configuration files of database, log files, log control files, etc. can put database in inconsistent state
Having said that corruption can be due to various reasons, finding out exactly what caused data corruption is challenging. In most cases, it is caused by file system issues and hardware problems.
Identify and troubleshoot
For a DBMS, page is the smallest unit of data for memory allocation performed by the operating system for a program and transfer between main memory and any other auxiliary store, such as a hard drive. So when you say something in a database is corrupt, you actually mean some pages in a database are corrupt.
Panic is a method DB2 uses to cause itself to crash if there is an error condition it cannot handle gracefully. When a page corruption is detected by DB2, it halts all processing by means of a controlled crash (panic), since it can’t determine database integrity. This is also to prevent further damage or loss of data.
A number of error messages are dumped in db2diag.log when DB2 encounters corruption in database. When an outage occurs and automatic first occurrence data capture (FODC) is enabled, data is collected based on symptoms. FODC data will be collected automatically on DB2 9.5 when one of the following conditions is met:
- FOCD_Trap when an instance wide trap has occurred.
- FODC_Panic when a DB2 engine detects an incoherence and decides not to continue.
- FODC_BadPage when a bad page has been detected.
- FODC_DBMarkedBad when a database has been marked bad due to an error.
It is crucial to involve OS and hardware support in order to gather required information like OS diagnostics (for example,
fileplace outputs on AIX®), as well as any and all hardware diagnostics (state saves, error logs, etc). It is important to ensure that there is adequate disk space for critical file systems like dump space and log directories to ensure that critical events are captured in entirety.
Looking at the db2diag.log, you can confirm if panic is due to corruption or another reason. Below, you will see how to identify and categorize corruption in DB2. Following are some of the most common db2diag.log error messages that identify corruption.
Data page corruption
Data page corruption indicates corruption in actual data in a table. A number of messages are logged to the db2diag.log when data corruption is detected. These messages are necessary to identify the affected object. Listing 1 shows a sample error message in db2diag.log.
Listing 1. Sample error message for data page corruption
As the above error messages indicate, DB2 encountered a bad page for tablespace ID 9 and table ID 20. The object type filed is marked 0, which indicates data page corruption.
You can query the catalog table to determine which table has corrupt pages:
db2 "select char(tabname,20), char(tabschema,20) from syscat.tables where tableid=20 and tbspaceid=9"
Note that DB2 dumps bad page errors only for those pages it has tried to access. It does not necessarily mean that only those pages are corrupt. You need to explicitly check all pages in your database to find the extent of corruption, with the help of the
Similar error messages are dumped in db2diag.log for corruption in temporary tables. If the data type field in db2diag.log error message holds a value greater than 128, it indicates corruption in the temporary table. If the object type is 3, it indicates LOB data corruption in a table.
An index is a database object that contains an ordered set of pointers that refer to rows in a base table. There are many flavors of index-related corruption problems, including:
a. Unique index duplicates with different RID(s) or the same RID(s) b. Multiple index entries pointing to the same RID c. Out-of-place index key (wrong index key order) d. Row exists, but index keys do not exist in any or some of the indices e. Index entry pointing to an empty data slot or unused data slot or RID is invalid f. Incorrect previous or next index page pointers, incorrect high key or other corruptions on index pages
Following are sample error message for index corruption.
Listing 2. Sample error message for index corruption
As you can see, error messages have object type:1, which indicates index page corruption. The index resides in tablespace ID 9, and the index ID is 11076. This index is on a table with tablespace ID 8, table ID 11076. You can retrieve base table name and index name by querying catalog tables.
The above snippet from db2diag.log indicates a bad page for index. Other common errors in db2diag.log that point to index corruption are SQLI_NOKEY and row not found from index function.
CBITs are a method used by DB2 to verify that a page being read into the buffer pool from disk is not a partial page or has not been changed from some form of corruption.
The basic idea behind CBITs is that a bit from each sector (512 bytes) on a page is set to the same value before writing the page. Before DB2 flushes a page to disk, the checksum is calculated and recorded on the page. When a page is read back in the buffer pool, this checksum is recalculated and checked against the stored value. If some of the bits are different, it indicates a partial page write or disk corruption.
Listing 3. Sample error message for CBIT corruption
2012‑03‑12‑04.45.17.559235‑240 I1104A2616 LEVEL: Severe PID : 2551866 TID : 1 PROC : db2pfchr INSTANCE: inst1 NODE : 000 FUNCTION: DB2 UDB, buffer pool services, sqlbVerifyCBITS, probe:1110 MESSAGE : ZRC=0x86020019=‑2046689255=SQLB_CSUM "Bad Page, Checksum Error" DIA8426C A invalid page checksum was found for page "". DATA #1 : String, 64 bytes Error encountered trying to read a page ‑ information follows : DATA #2 : String, 95 bytes CBIT verification error bitExpected is 0, userByte is 33, sector 7 (from head of page, 0 based) DATA #3 : Page ID, PD_TYPE_SQLB_PAGE_ID, 4 bytes
Note that changes made to the page outside of DB2 (file system failures, disk failures, etc.) will not be noticed if the corruption does not touch the bytes holding the CBITs. CBIT errors ( checksum errors ) are most of the time due to hardware or OS error, either during the initial write or during the read.
The transaction log in DB2 is simply a record of all changes that have taken place in the database. To keep track of changes made by transactions, a method is needed to timestamp changes to data as well as to timestamp log records. In DB2, this timestamp mechanism is performed using a Log Sequence Number (LSN). In case you come across a corrupt log, you may have an error message in db2diag.log similar to.
Listing 4. Sample error message for log corruption
2010‑06‑07‑184.108.40.206998+120 I8673583A553 LEVEL: Severe PID : 2498668 TID : 27358 PROC : db2sysc 56 INSTANCE: inst1 NODE : 056 DB : SAMPLE APPHDL : 998‑22947 APPID: N998.inst1.100607192315 AUTHID : LOADDSF EDUID : 27358 EDUNAME: db2agntp (SAMPLE) 56 FUNCTION: DB2 UDB, data protection services, sqlpgrlg, probe:291 DATA #1 : < reformatted > Error ‑2028994519 when reading LSN 00000B1C261B4FD3 from log file S0119292.LOG tellMe 1 dpsAcbFlags 1000 setSkipOutputBuf 0 2010‑06‑07‑220.127.116.11202+120 I8674137A487 LEVEL: Severe PID : 2498668 TID : 27358 PROC : db2sysc 56 INSTANCE: inst1 NODE : 056 DB : SAMPLE APPHDL : 998‑22947 APPID: N998.inst1.100607192315 AUTHID : LOADDSF EDUID : 27358 EDUNAME: db2agntp (SAMPLE) 56 FUNCTION: DB2 UDB, data protection services, sqlpgrlg, probe:291 DATA #1 : < preformatted > HeadLsn 00000B1B8B996EB3, copyLookForLsn 00000B1C261B4FD3 2010‑06‑07‑18.104.22.168065+120 I8675153A549 LEVEL: Error PID : 2498668 TID : 27358 PROC : db2sysc 56 INSTANCE: inst1 NODE : 056 DB : SAMPLE APPHDL : 998‑22947 APPID: *N998.inst1.100607192315 AUTHID : LOADDSF EDUID : 27358 EDUNAME: db2agntp (SAMPLE) 56 FUNCTION: DB2 UDB, data protection services, sqlptudo, probe:1010 RETCODE : ZRC=0x87100029=‑2028994519=SQLP_BADLSN "Invalid LSN value." DIA8538C An invalid log sequence number (LSN), the value was "".
If logs are corrupt, it can result into a severe problem during roll-forward of the database, crash recovery, HADR logs reply, etc., where you actually reply to logs. Database roll-forward enables you to maintain consistency in the database. It recovers a database by applying transactions recorded in the database log files. Roll-forward is a process invoked after a database or a tablespace backup image has been restored.
Packed descriptor corruption
A packed descriptor is a column within the system catalog tables that DB2 uses to identify the details of a database object. If it is corrupt for some reason, you see errors in db2diag.log as shown below.
Listing 5. Sample error message for packed descriptor corruption
If you have corrupt PD, the table becomes inaccessible.
INSPECT to identify corruption
db2dart is a command that can be used to verify the architectural correctness of databases and the objects within them. It can also be used to extract data from tables that might otherwise be inaccessible due to corruption.
db2dart should never be run against a database that still has active connections.
db2dart accesses the data and metadata in a database by reading them directly from disk. So, if there are connections,
db2dart will be unaware about pages in the buffer pool, control structures in memory, etc. and may report false errors as a result. Similarly, if you run
db2dart against a database that requires crash recovery or that has not completed roll-forward recovery, you may get inconsistent results.
To display all possible options, execute the
db2dart utility without any parameters. Some options of
db2dart require user input, such as tablespace ID and table ID, which are prompted for if not explicitly specified in the command line.
Inspecting databases, table spaces, and tables using
The default behavior for
db2dart is to inspect the entire database. Only the database name must be provided in this case. By default,
db2dart will create a report file with the name databaseName.RPT. For single-partition database environments, the file is created in the current directory. For multiple-partition database environments, the file is created under a subdirectory in the diagnostic directory. The subdirectory is called DARTnnnn, where nnnn is the partition number.
If a database is large and you are only interested in one tablespace, you can use the
/TS option. When using this, you must provide the tablespace ID on the command line (by specifying the
/TSI parameter) or you can let
db2dart prompt you for it. If you do not know the tablespace ID, you can obtain it via
DB2 LIST TABLESPACES.
Similarly, a single table and its associated objects (LOBs, indices, etc.) can be inspected using the
/T option. When using this, you must provide the table name or object ID and the ID of the tablespace in which the table resides. To determine the object ID and tablespace ID for a table, you can query the SYSIBM.SYSTABLES catalog table.
Dumping formatted table data using
If a tablespace or table becomes corrupt for any reason (for example, due to a bad disk or disk controller), attempts to access the table through SQL may not work. The SQL statement may fail with an error or the database may be marked bad and all connections will be dropped.
If this happens, it may be necessary to extract all the data possible so the tablespace and table can be rebuilt. In such a situation, the
/DDEL option of
db2dart can be used to extract the table data and place it into a delimited ASCII. Following sessions discuss these options in details.
INSPECT command is similar to the
db2dart command. It allows you to check databases, tablespaces, and tables for architectural integrity by checking the pages of the database for page consistency. A significant difference between the two commands is that the database needs to be deactivated before you run
INSPECT requires a database connection and can be run while there are simultaneous active connections to the database.
INSPECT command cannot be used to format and dump data pages, format and dump index pages, format data rows to delimited ASCII, and mark an index invalid. So, it can only inspect database or its objects online for any corruption.
Approaches to recover from corruption
Database corruption issues are sometimes not very straightforward and need IBM’s support expertise to choose the best possible way out of the situation. Here, you will see how to tackle the most common types of corruption issues and look for the best possible recovery options. When corruption is detected, determining the problem source is usually of secondary importance. You would want to correct current situation first.
Recovering from data page corruption
As discussed, you need to identify corrupt pages using tools like
db2dart. Looking at the
db2dart report file generated, you can gauge the extent of corruption. Depending on the amount of data corrupt and complexity involved, you need to decide on the best possible recovery plan. Here are some options to recover:
- If there is a backup of the database, restore the database and roll forward to end of logs. This is the cleanest approach if feasible, preferably if database size is small.
- You could also restore the tablespace and roll forward to end of logs. This can be the best option if the corruption is localized.
- If you have other ways of recreating the data for the corrupt table or have a copy of the table data, drop and recreate the table. You would need the DDL of the table, and if you have the data for the table, you should be able to drop the table, recreate the table using the DDL, and recreate the data through whatever means you may have.
- If you do not have a valid backup image and any way to recreate the table, you may use
/ddelto salvage data from a corrupt table. Before that, you need to have the DDL of the table using
db2look. You can fetch the DDL of a corrupt table using:Here is an example of using
db2look ‑d <dbname> ‑e ‑z <schema_name> ‑t <table_name> ‑o <output_file_name>
/ddelto salvage data from a corrupt table:
db2dart <dbname> /ddel. This command requires four input values: a table object ID or table name, tablespace ID, page number to start with, and number of pages. The number of pages can be a specific value or a large enough value (for example, 999999999) to extract all pages in the table. Also, if a particular page in the table contains too much damage, the
db2dart /DDELmay need to be executed more than once for the range of pages and omit the damaged page. The dumped delimited ASCII file is encoded in the database code page. The
db2dartcommand does not perform code page conversions. Once you have salvages all the data from corrupt table, you can check the output file *.DEL to make sure all the data exists. Once that is done, you can drop the corrupt table and recreate it later with data extracted using
db2dart. Note that
db2dart /DDELdoes not work against lob data.
- If you have a way to recreate the corrupt tablespace, you can mark the corrupt tablespace in drop pending state using
restart database. You can recreate corrupt tablespace later.
- If none of the above options are feasible or give some error during extracting data, dropping the table while table recreate etc., approach IBM support for help. IBM support may help you to drop the corrupt table, initialize corrupt pages to NULL, etc., depending on the situation.
Recovering from index corruption
If there are indications of index corruption in db2diag.log and/or
db2dart reports, you can mark an index invalid using
db2dart and get rid of bad indices. You can recreate indices later.
db2dart has an option to mark an index invalid and make it drop pending. You can mark a corrupt index invalid with
db2dart /MI. For example: `db2dart
You can decide when to recreate an index by setting the
INDEXREC parameter to restart, access, etc. To recreate indices when an application tries to access the index, you can update
INDEXREC to access:
db2 update db cfg for <dbname> INDEXREC
Once you have invalidated the bad index and updated
INDEXREC, you can connect to the database.
If the problem is an index NOKEY error, then a number of
db2dart commands are printed in the db2diag.log file. You can run these
db2dart <dbname> /di commands to dump formatted index data for root-cause analysis of index corruption if required. You need to save these commands using
grep on UNIX® or Find on Windows® and save them in a file. Edit the file and replace DBNAME with the database name. If the problem has been hit multiple times, there could be duplicate entries. You only need to keep the latest set of
db2dart commands. To confirm the bad index using
db2dart, you can issue
db2dart <dbname> /t /tsi <tablespace_id> /oi <table_id> where
table_id are the tablespace ID and object ID of the base table on which the index is defined.
Recovering from CBIT errors
To fix CBIT corruption errors, examine the extent of the problem by running a
db2dart on at least the corrupt table (better to run it against the whole database). You can decide on approaches discussed above depending on whether CBIT errors are on the data page or index. The most feasible option to recover from CBIT errors is to restore the database or tablespace (if errors are localized).
Recovering from log corruption
Log corruption is a matter of concern during log reply. You need to replay logs during database or tablespace roll-forward, log replay on HADR standby, and crash recovery. In the event of log corruption, the database may be fine and only the log may be damaged.
If you have errors due to bad logs during roll-forward, the very first thing that can be done is to check which log file DB2 is reporting error for.
db2flsn can be used to return the name of the file that contains the log record identified by a specified log sequence number (LSN). So if you have ‘bad_lsn’ messages in db2diag.log, you can use
db2flsn to find the corresponding log file.
If it is for a missing log file or a log file from an incorrect log chain, you can look for the correct log file. If roll-forward fails due to a corrupt log, you can go for point-in-time roll-forward. The specified point in time for the roll-forward operation must be equal to or later than minimum recovery time. The minimum recovery time (MRT) is the earliest point in time during a roll-forward when a database is consistent. If you cannot roll forward logs to at least MRT, you need to contact IBM support for assistance. Another option would be to restore from an offline database backup and not roll forward the logs. Transactions in logs would not be applied to the database in this case.
If you have issues in crash recovery due to a bad log file, you need to restore recent backup or contact IBM support for assistance.
In HADR terminology, the server that processes transactions is known as primary and the partner database that receives logs and replays them is referred to as standby. On HADR standby, during log reply, standby may crash due to a bad log. You can check db2diag.log on standby to figure out the bad log file and try shipping a good copy of that log file from primary. Once a good log file is in place, you can start HADR:
a. Run start on HADR of standby node:
db2 start hadr on db <dbname> as standby.
b. Run start of HADRon primary node:
db2 start hadr on db <dbname> as primary.
If the above attempt fails, you may need to reconfigure HADR with fresh backup from primary and restore it on standby. If you pass a copy of the bad log file to IBM DB2 support, they could examine its contents to see what was in it that might give some indication of what went wrong.
You can use
db2cklog to check the validity of archive log files to determine whether the log files can be used during roll-forward recovery of a database or tablespace. Either a single archive log file or a range of archive log files can be checked. A log file that returns an error during validation by
db2cklog will cause the recovery operation to fail.
Recovering from packed descriptor corruption
You can use the db2cat tool to fix the corrupt the packed descriptor. You need to consult IBM support before modifying a packed descriptor. Perform the following steps to modify the packed descriptor:
export DB2SVCPW=<service_password from IBM support>
db2cat -d <dbname> -s <schema> -n <tablename> -f <raw pd output file> -o <message file>
db2cat -d <dbname> -s <schema> -n <tablename> -g <generated pd output file> -o <message file>
db2cat -d <dbname> -s <schema> -n <tablename> -r <generated pd output file> -o <message file>(use the file generated with -g as the input for the replace option “-r”).
- Export the data from the table (if required)
- Drop the table
- Recreate the table (if required)
- Import the user data (if required)
You can now run the verify option of
db2cat on the database again:
db2cat -d <dbname> -s % -n %
If you do not need the tables, you can drop them. Otherwise, you should replace the packed descriptor, extract the data, then drop and recreate the table.
Preventive strategies to avoid possible corruption
Database corruption can be subtle and difficult to detect. So there could never be a tool developed that would be able to detect every single corruption that could possibly happen. Corruptions and procedures that risk the possibility of corruption should always be avoided.
It is essential to understand various preventive measures to avoid possible downtime and database crash. Here are some of the best practices to help to identify corruption in a database well before a system crash:
- Keep track of all changes.
- Be on the latest fixpack and, if possible, the latest version of DB2 and operating system (if applicable).
- Regular check on file system health, network issues.
- As much as possible, shut down DB2 gracefully.
db2dartagainst your database when offline to check for corruption. If you don’t have the luxury of downtime required to run
db2darton the production database, restore recent production backups on to test machines and run
db2dart. Alternately, you can also use
INSPECTwhen database is online. It can work as early detection or proactive health-checking for corruption.
- Have a good backup policy. Backup does not detect corruption in a page, so it is recommended to have a strong backup policy and enough backup generations.
- Disk configurations like RAID help minimize data corruption by using redundant disks to back up data.
- Good power backup to combat corruption due to power surges.
- Track latest defects in IBM DB2 and operating system.
The article discussed the most common corruption issues when using IBM DB2. It has demonstrated various symptom messages of corruption issues in db2diag.log, how to identify type of corruption based on these messages, and how to troubleshoot these issues. It also described the
INSPECT commands, which are helpful in dealing with corruption issues.