Troubleshooting Guide

Troubleshooting Tips

This section describes how to deal with some frequently encountered problems faced by users working with DB2 Extended Enterprise Edition. It addresses the following topics:

"Extended Enterprise Edition Won't Install"
"Severe Error Occurs on System"
"Determining Failing Node"
"Database Manager Won't Start or Hangs"
"Cannot Create Database"
"Commands Cannot Be Issued"
"Problems with Splitting and Loading Data"
"Problems with Indexing Data"
"Problems During Recovery"
"Developing Applications in DB2 Extended Enterprise Edition"

Important: This section represents a small sampling of the information available from DB2 Customer Service. For a complete and up-to-date source of DB2 information, use the DB2 Product and Service Technical Library at http://www.software.ibm.com/data/db2/library/.

Extended Enterprise Edition Won't Install

Symptom

The DB2 Extended Enterprise Edition product won't install.

Action

Ensure that:

[ ]: The DB2 product directory on each node has enough free disk space. This directory is /usr/lpp/db2_05_00 on AIX.
[ ]: All prerequisite software products are installed on the system.
[ ]: The same version of the product is not already installed on the system.

Severe Error Occurs on System

There are two distinct types of severe error:

A DB2 process is killed because of a programming exception.
In this case, the database manager is immediately terminated at the node and active units of work are not rolled back. (That is, transactions stay indoubt.) When other nodes detect the failure, depending on their relationship with the failing node they will attempt to recover as follows:
- If the failing node was the catalog node of the database, the entire database is brought down at all other nodes because all agents are forced off the node.
  In this situation, you must restart the node by running db2 RESTART DATABASE for all databases on that node. (See the Command Reference.)
- If the failing node was the coordinator for the application, all subagents running at the node on behalf of the coordinator are forced off the node and the active unit of work is rolled back.
- If the failing node was a subordinate node, any coordinator agent which has an active unit of work that includes the failed node rolls back the unit of work if it was not already in the COMMIT phase. If it is in the COMMIT phase, the SQL code -279 is returned to indicate that the transaction is indoubt.
A problem caused the database to be marked as inconsistent.
Any attempt to access the database returns SQL code -1034 (SQL state 58031) or SQL code -1015 (SQL state 55032).
In this case, all coordinator agents and subagents are forced off the database. These agents then roll back the current unit of work and disconnect from the database.
Before any application on any other node can access the database at this node, you must put the database in a consistent state by running db2 RESTART DATABASE against it.

The severe error SQL code -1224 (SQL state 55032) can occur for a variety of reasons. If you receive this message, check the SQLCA to determine which node failed, and then the db2diag.log file for additional details. For information, see "Determining Failing Node" and "Interpreting the db2diag.log File".

Note: In the event of a severe error where multiple machines were involved, you may find diagnostic information in their syslog files rather than in the db2diag.log file (if they were not able to obtain an NFS lock on the db2diag.log file).

Determining Failing Node

To find out which node failed, check the db2diag.log file on the coordinating node for the application. If a node returns an error or warning, the node number is identified in the SQLERRD(6) field of the SQLCA. This number corresponds to the node number in the db2nodes.cfg file. (If an SQL statement or API call is successful, the node number in the SQLERRD(6) field is that of the coordinator agent.)

For more information on the db2diag.log file and SQLCA structures, see "Interpreting the db2diag.log File".

Database Manager Won't Start or Hangs

db2start fails

Symptom

The db2start command fails.

Action

Ensure that:

[ ]: The product was successfully installed.
[ ]: Your user ID has SYSADM, SYSCTRL, or SYSMAINT authority. See the Administration Guide for details.
[ ]: You have the same user ID, group identifier, and password set for the instance owners on all nodes.
[ ]: You can use the rsh command from the DB2 instance id on all nodes defined in the db2nodes.cfg file. (To do this, add an entry for the node in the $HOME/.rhosts file or the hosts.equiv file. Ensure that the file permissions on these files allow read access for all users.)
[ ]: If you are using the high performance switch (HPS) on RS/6000 SP systems, ensure that it is up on all nodes. See the description for spmon in "Commands for DB2 Extended Enterprise Edition".
[ ]: The db2nodes.cfg file (located in the $HOME/sqllib directory) contains the correct information for all nodes defined in the system (that is, the host names and net names are valid).
[ ]: FCM communication is properly enabled. You must create an entry for each DB2 logical port in the /etc/services file. Refer to the DB2 Extended Enterprise Edition Quick Beginnings for details.
[ ]: The value for the DB2INSTANCE environment variable is the same for every node and it matches the instance name that you are trying to start.
[ ]: You have the appropriate permissions.
Check the permissions of the sqllib/tmp directory and, if required, update them so that the instance ID has write permission for the directory. Note that if the permissions are not correct, a problem may have occurred during installation and setup of the system.

db2start hangs

Symptom: The db2start command hangs.
Action: Ensure that the $HOME directory of the instance owner is NFS-mounted across all nodes, and that NFS is running. If the lockd and statd daemons are not running, db2start may hang. To start the daemons, execute /etc/rc.nfs, and ensure that this command is in your /etc/inittab directory.

Nodes won't start or are very slow

Symptom

Some nodes cannot be started, or run extremely slowly.

Action

Ensure that:

[ ]

You installed the DB2 Extended Enterprise Edition product on all nodes.

Although you can install the product on an NFS-mounted file system in a RISC/6000 cluster, performance will be compromised. We recommend that you install the product on each node.

[ ]

All nodes are installed at the same level of DB2 Extended Enterprise Edition.

[ ]

NFS is running and there are enough NFS daemon (nfsd) processes assigned on the machine where the home directory of the instance owner resides.

Cannot Create Database

Symptom

A database cannot be created.

Action

Determine which node is experiencing the problem. The failing node number is stored in the sqlerrd(6) field of the SQLCA returned with the CREATE DATABASE command. Once you have determined the failing node, ensure that:

[ ]

Permission on the database directory path is correct.

The instance must have permission to write into the database directory path.

[ ]

You are creating the database on a path that exists on all nodes.

[ ]

The file systems are mounted.

The file systems for the database may have been unmounted. Mount all file systems again, and then try to create the database. You should check for unmounted file systems on all nodes.

[ ]

There is enough disk space.

You may have created a small test database and decided that you needed very little disk space. However, when a database is created, a certain amount of disk space is required for the default table spaces and default logs. See the Administration Guide for details.

Commands Cannot Be Issued

Symptom

DB2 or operating system commands are not recognized.

Action

Ensure that:

[ ]: Your DB2 Extended Enterprise Edition system is properly installed.
[ ]: You have sufficient authority to run the commands.
[ ]: If the home file system of the instance owner ID is mounted over Ethernet or a high performance switch (on a RS/6000 SP machine), Ethernet or the switch are running.
To determine if this is the problem, log onto the system as "root" and try to access a file in the instance owner's home file system, or use the netstat command to determine the status of the high performance switch.

Problems with Splitting and Loading Data

For more information on using the AutoLoader to split data and load it on a node, see the Administration Guide.

For more information on the db2autold command, type db2autold -h from the sqllib/misc directory.

Data is not split

Symptom: The db2autold program apparently completed successfully, but the data is not split.
Possible Cause: The db2autold program was set only to analyze the data.
Action: Check the AutoLoader configuration file and ensure that the Mode parameter is not set to analyze data. (This option does not split data. It only analyzes the data and suggests a new partitioning map.) See the Administration Guide for details.

Data is not split correctly

Symptom

The db2autold program apparently completed successfully, but the data is not split correctly.

Action

Ensure that:

[ ]

Binary data is not in a character column.

The db2autold program cannot detect binary values in a column of any data type, unless the BINARYNUMERICS or PACKEDDECIMAL options are specified on the load command.

[ ]

The SplitNodes and OutputNodes parameters are set correctly.

Otherwise, output data files may not be correct. See the Administration Guide for details.

Data not loaded

Symptom

LOAD does not load data

Action

Ensure that:

[ ]

LOAD was run by a user with the correct authority.

Users in the same group as the instance owner who have SYSADM or DBADM authority can execute LOAD.

[ ]

The table being loaded is not already being used by another application.

LOAD cannot share tables, so it will not execute until it can lock all required tables. Determine why the tables are already locked, and ensure that the locks are released.

[ ]

LOAD was started on all nodes.

You must execute LOAD on all nodes for data to be loaded in parallel. If you have a script that sends a shell to all nodes, ensure that the LOAD execution is not serialized by the remote shell.

To check if LOAD failed on a node, look at the message files:

If you issue the LOAD command directly, you can specify the name and location of a message file using its MESSAGES option.
If you use the db2autold command, message files are located in the current working directory. There is a file for each node, and each file is named load_log.nnn, where nnn is the node number as specified in the db2nodes.cfg file.

[ ]

The format of the data is correct.

See the Command Reference for more information on using correct data formats. Note in particular that:

Data cannot contain non-delimited blanks. LOAD processes non-delimited blanks in a delimited ASCII (DEL) file as null, and cannot load this data into a non-nullable column. You must use a string delimiter for blank columns.
For non-delimited data, use the null indicator to indicate columns with null values.
Data must contain decimal points in the correct position.
For data being loaded from MVS systems, you must explicitly put the decimal points in the data before loading it. Otherwise, they are assumed to be at the end of the column.

Data doesn't match the select count

Symptom

The data that is loaded does not match the select count.

Action

Ensure that:

[ ]

The data is split correctly by your conversion program.

The db2autold program creates a header file to prevent data from being loaded to the wrong node. The header information is checked by LOAD.

However, if you use a conversion program to convert binary columns to the character format, the conversion may not be the same as that done by the db2autold program. Data will be split for the correct node, but during the conversion data may be converted to a different value and may not be hashed the same way.

[ ]

The partitioning columns match the header.

If the data is split using one set of partitioning keys, and is loaded into a table that was partitioned on other columns, the LOAD operation will fail. Partition column information is in the header of the split data file. If, however, you manually alter the header, this compromises the validation measure built into the AutoLoader.

All rows rejected

Symptom: LOAD rejected all rows.
Possible Cause: Column definitions are not correct.
Action: When using METHOD L, ensure that the data column specification is correct. Shifting columns can cause truncation errors or data that does not match the table column definition.

Rows not loaded

Symptom: LOAD completed but no rows are loaded.
Possible Cause: LOAD rejected all rows.
Action: Check the db2load file in the temporary directory where you ran db2autold to see if all the rows were rejected. If some rows were loaded, see the Administration Guide.

Problems with Indexing Data

Symptom: The CREATE INDEX statement fails or hangs.
Possible Cause: The table space is full, or there is insufficient log space.
Action: Ensure that there is sufficient disk space for the index pages and for sorting; and that there is sufficient log space.
For information on calculating disk size and on determining log size, see the Administration Guide.

Problems During Recovery

Connection appears suspended

Symptom

A connection appears suspended.

Possible Cause

The database was restarted and is still recovering.

When crash recovery is in progress, it can take a long time to recover the database from a failure, particularly when operations that require a large volume of logging activities (such as massive inserts) were running before the database crashed. This is a normal situation.

Action

Check the db2diag.log file of the coordinator node to see if crash recovery is completed. Entries in the db2diag.log file will indicate when recovery is started, and when it is finished.

The connection will finish once recovery is complete on all nodes.

SQL1061W received

Symptom: The SQL1061W message is received during recovery.
Possible Cause: There are indoubt transactions that cannot complete, typically because one or more nodes cannot be started. The database is recovered and open for user connections, but the indoubt transactions are taking up memory and other resources.
Action: Check the db2diag.log file of the coordinator node to see if all nodes are started.
Resolve the indoubt transactions as soon as possible. For information, see the discussion on heuristics in the Administration Guide.

Developing Applications in DB2 Extended Enterprise Edition

In a multinode environment, DB2 breaks up SQL statements into subsections, each of which is processed on the node that contains the relevant data. As a result, an error may occur on a node that does not have access to the application.

When developing applications for multiple nodes, consider the following:

Errors are reported in the SQLCA. (See "Interpreting an SQLCA Structure" and Appendix B. "SQL Communication Area (SQLCA)".)
If a node returns an error or warning, a node number that maps to the db2nodes.cfg file is placed in the SQLERRD(6) field of the SQLCA. (If an SQL statement or API call is successful, the node number in the SQLERRD(6) field is that of the coordinator agent.)
Indoubt transactions may be left in the database if an application ends abnormally during 2-phase commit. (An indoubt transaction pertains to global transactions when some phases complete successfully, but the system fails before all phases can complete. The database is left in an inconsistent state.) For information on handling indoubt transactions, see the Administration Guide.
An application may receive a subsequent error or warning after the problem that caused the first error or warning is corrected. Only one severe error is reported to the SQLCA at a time and the error is not overwritten. Severe errors and deadlock errors are given higher priority because they require immediate action by the coordinator agent. See "DB2 Extended Enterprise Edition Usage of the SQLCA" for more special considerations when using the SQLCA in a multinode environment.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]