Administration Guide

Using the LOAD Utility

The LOAD utility is intended for the initial load or an append of a table where large amounts of data are moved. There are no restrictions on the data types used by the LOAD utility including large objects (LOBs) and user-defined types (UDTs). The LOAD utility speeds up the task of loading large amounts of data into a database. LOAD is faster than IMPORT because LOAD writes formatted pages directly into the database while IMPORT does SQL INSERTs. The data being loaded must be local to the server (unlike IMPORT and EXPORT where data can be passed from the client).

In addition to the overview to the LOAD utility above, here are some details about the LOAD utility that may be of interest to you. The LOAD utility also almost completely eliminates the logging associated with the loading of data. In place of logging, you have the option of making a copy of the loaded portion of the table. LOAD does not fire triggers; and does not perform referential, and table, constraint checking (other than validating the uniqueness of the unique indexes). Tables with such options defined may be populated faster, or more simply, using IMPORT. If you have a recoverable database, you can do one of the following:

Use the non-recoverable LOAD option (and not require a backup)
Explicitly request a copy of the loaded portion of the table be made
Take a backup of the table space(s) in which the table resides after the completion of the load.

The LOAD utility can take advantage of a hardware configuration where multiple processors and/or multiple storage devices are used such as in a symmetric multiprocessor (SMP) environment. There are several ways in which parallel processing of large amounts of data can take place using the LOAD utility. One way is through the use of multiple storage devices which allows for I/O parallelism during the LOAD process. Another way involves the use of the multiple processors in an SMP environment which allows for intra-partition parallelism. And both can be used together to provide even faster loading of the data.

The following topics provide more information:

"Overview of the LOAD Process"
"Details About LOAD"
"LOAD Performance Considerations"
"LOAD Temporary Space Limitations"
"Restarting LOAD and Database Recovery"
"LOAD Exception Table"
"Checking For Constraint Violations"

Overview of the LOAD Process

There are multiple phases to the LOAD process: Load, where the data is written into the table; Build, where the indexes are created; and Delete, where the rows that caused a unique key violation are removed from the table. You must run the SET CONSTRAINTS SQL statement after the load completes if there are tables left in the check pending state to validate the table for referential integrity and check constraints. The LOAD utility generates messages about the progress of each phase. If a failure occurs during the LOAD process, these messages will assist you in deciding how to recover.

During the Load phase, data is loaded into the table; index keys and table statistics are collected if necessary. Save points, or points of consistency, are established at intervals specified by you in the SAVECOUNT parameter on the LOAD command. These points of consistency are not established exactly on the number of rows specified with the SAVECOUNT parameter; rather the number of rows are converted to a page count, and rounded up to intervals of the extent size. Messages let you know how many input rows were successfully loaded at the time of the save point. If a failure occurs, you should use the number of input rows at the last successful consistency point with the RESTARTCOUNT parameter during a restart. If the failure occurs near the beginning of the LOAD process and you were doing a REPLACE, you might consider restarting the load using the REPLACE option.

During the Build phase, indexes are created based on the index keys collected in the Load phase. The index keys are sorted during the Load phase and index statistics are collected. The statistics collected are similar to those collected during RUNSTATS. If a failure occurs, the Build phase restarts from the beginning.

Unique key violations are placed into the exception table, if one was specified, and messages on rejected rows are put into the message file. Following the completion of the LOAD process: review these messages, correct any problems, and insert the corrected rows into the table.
Note: The recording of warnings has a detrimental effect on the performance of the LOAD. If performance is important, and you anticipate a large number of warnings, you should consider using the NOROWWARNINGS filetype modifier. If this filetype modifier was specified, these warnings are suppressed.

During the Delete phase, all rows causing a unique key violation are deleted. If a failure occurs, this phase should be restarted by you from the beginning. Information on the rows containing the invalid keys is stored in a temporary file. After you request a restart to begin at the Delete phase, the violating rows are deleted based on the information in a temporary file. You must not modify any data in any temporary files. Also, you must restart the LOAD command with the same parameters, otherwise the Delete phase will fail. If the temporary file has been modified, or does not exist, you should restart the LOAD command at the Build phase. Once the index is re-built, any invalid keys are placed in the exception table if it exists, and duplicate keys are deleted.

Since regular logging is not performed, LOAD uses pending states to preserve consistency of the database. The Load and Build phases of the LOAD process place any associated table spaces into a load pending state. The Delete phase of the LOAD process places any associated table spaces into a delete pending state. If you complete the LOAD process but you do not have either logretain or userexit "on"; and, you have not specified the COPY YES option nor the NONRECOVERABLE option, then any associated table spaces are placed in a backup pending state. These states can be checked by using the LIST TABLESPACES command. (For more information on LIST TABLESPACES, see the Command Reference manual.) One last possible state associated with the LOAD process is concerned with referential and check constraints. Dependent tables may be placed in a check pending state following the completion of the LOAD process.

LOAD can also be used with a non-recoverable option. This allows you to perform a non-recoverable LOAD without affecting the recoverability of all other tables in the database. When this type of LOAD is run, there is no requirement for either using the COPY YES option or having a backup taken.

If a LOAD fails, the table space(s) involved could be in an inconsistent state because there is no logging. For this reason, the table spaces are left in a load pending state. To remove the load pending state, you will have to restart the LOAD, perform a LOAD REPLACE on the same table on which the LOAD failed, or recover the table space(s) using a RESTORE with the most recent backup (either table space or database backup) and then carry out further recovery actions. (You could also drop the table space and then re-create it.)

For more information on how to recover, see Chapter 6. "Recovering a Database".

Details About LOAD

The LOAD utility inserts data into a table from an input file, from a device, or using a named pipe, any of which must reside on the node where the database resides. The table must exist. (Indexes on the table may or may not exist. LOAD only builds indexes that are already defined on the table.) If the table receiving the new data already contains data, you can replace or append to the existing data.
Note: If you are loading a table that already contains data and the database is non-recoverable, make sure that you have a backed-up copy of the database, or the table space for the table being loaded, before using LOAD so that you can recover from errors.
If the existing table is a parent table containing a primary or unique key referenced by a foreign key in a dependent table, replacing data in the parent table places the dependent table in a check pending state. The SET CONSTRAINTS statement must then be used to validate the referential and check constraints.

The table spaces in which the loaded table resides are quiesced in exclusive mode. For more information on QUIESCE, see the Command Reference manual.
Note: In the following command line processor example, and in the other in the other examples in this chapter, relative path names are used. Please be aware that relative path names are only allowed on calls from a client on the same node as the database. The use of fully-qualified path names is recommended.

The following is an example of the command line processor syntax for the LOAD command:

     db2 load from stafftab.ixf of ixf messages staff.msgs
        insert into userid.staff copy yes use adsm data buffer 20

This example assumes no indexes are involved, any warning or error messages are placed in staff.msgs, a copy of the changes made is stored in ADSTAR Distributed Storage Manager (ADSM), and 20 pages of buffer space are to be used during the load. See "ADSTAR Distributed Storage Manager" for more information on using ADSM.

If indexes are involved, the following is an example of the command line processor syntax for the LOAD command:

     db2 load from stafftab.ixf of ixf messages staff.msgs
        remote file stafftmp replace into staff using . sort buffer 200

This example is similar to the previous one except that: stafftmp is used as a base file name for temporary files such as "stafftmp.msg", "stafftmp.log", and "stafftmp.rid"; the current directory of the user is used as the working directory; and 200 pages of buffer space are used to sort index keys. For more information on the LOAD command, see the Command Reference manual.

Since you will typically be loading large amounts of data using the LOAD command, a LOAD QUERY command can be used to check the progress of the LOAD process if the REMOTE FILE option has been specified. You require a connection to the same database and a separate CLP session to enter this command. This command can be used by local and remote users. For more information on the LOAD QUERY command, see the Command Reference manual.

The LOAD utility can also be invoked by the application programming interfaces (APIs) sqluload and sqlgload. For more information on the requirements when loading data to a table using these APIs, see the API Reference manual.

You should also review the following points. Each represents a task that you may need to perform and each is carried out as part of the LOAD command. For more information on any of the following tasks, see the API Reference and/or the Command Reference manuals.

Definition of the path(s) and the input filename(s) in which the LOBs are stored. The use of LOBSINFILE in the MODIFIED BY parameter will tell the LOAD utility that all LOB data is being loaded from files.
Determine if column values being loaded have implied decimal points or not. The use of IMPLIEDDECIMAL in the MODIFIED BY parameter will tell the LOAD utility that there are decimal points to be applied to the data as it enters the table. For example, with the IMPLIEDDECIMAL, the value 12345 is loaded into a DECIMAL(8,2) column as 123.45, NOT 12345.00.
Determine whether statistics will be gathered during the load process. (This option is not supported if the load is in INSERT or RESTART mode.) Gathering statistics is only valid when indexes are already defined on the table prior to the load.
Define the method to use for loading the data: column location, column name, or relative column position.
Determine whether to keep a copy of the changes made. (This option is not supported if forward log recovery is disabled for the database; that is, LOGRETAIN=OFF and USEREXIT=OFF.) If no copy is made, and forward log recovery is not disabled, the table space is left in a backup pending state at the completion of the load.
For more information on how to recover, see Chapter 6. "Recovering a Database".
Declare the location of the directories holding temporary files during the creation of indexes. The directories are defined by the USING parameter. When defining the directory, if it is unqualified and a local connection, the directory will be qualified using the current directory of the user running the load. If the directory is unqualified and part of a remote connection, the load will fail. If the parameter is not defined by the USING parameter, the default is to create the files in the tmp subdirectory in the directory defined by the DB2INSTPROF environment variable. You should be aware that the amount of information stored in this directory will be at least equal to the size of all the indexes and possibly twice that size.
For all other temporary files either a base name defined by REMOTE FILE is used or, the default name db2utmp is used. There are many temporary files that the load process may create: The temporary files for sorting go in the path specified by the USING option; other temporary files go in the path specified by the REMOTE FILE option. Only the remote message temporary file is of any interest to you. You can use the LOAD QUERY command to save the messages generated by the load process to a file that can later be viewed as output. See the Command Reference for further information on these topics.
The remote file resides on the server machine and is accessed by the DB2 instance exclusively. Therefore, any filename qualification given to this parameter must reflect the directory structure of the server, not the client, and the DB2 instance owner must have read and write permission on the file. In addition, the user must ensure that two loads are not issued having the same fully-qualified remote filename.
There are several ways in which the remote filename may be selected and qualified when the user has just given a partially qualified name, or no name at all.
Notes:
1. No remote filename is given in a load operation where the user is on the same machine as the database instance. In this case, the load utility uses the name db2utmp and qualifies it with the current working directory of the user. Two loads from the same directory with this option will clash in the use of the remote filename. Therefore, this option is not recommended.
2. No remote filename is given in a load operation where the user is on a different machine than the database instance. In this case, the load utility will generate a name that will reside in the database directory. This effectively prevents the user from using the load query facility, since the facility requires the name of the remote file. In addition, the filename generated is not guaranteed to be unique, and therefore clashes may occur between different load operations. Therefore, this option is not recommended.
3. A non-fully-qualified filename is given in a load operation, where the user is on the same machine as the database instance. In this case, the name is qualified by using the current directory of the user. The user must ensure that the two loads are not issued from the same directory with the same remote filename.
4. A non-fully-qualified filename is given in a load operation, where the user is on a different machine than the database instance. In this case, the load utility will reject the filename. It must be fully-qualified from the client.
5. A fully-qualified filename is given in a load operation. This will be the filename used. The user must ensure that two loads are not issued with the same remote filename. This is the recommended usage.
You should use a remote filename if you wish to use the LOAD QUERY tool. It is difficult to determine the remote filename if you do not provide the one to be used. For more information on LOAD and the importance of the filename, see "Restarting the LOAD".

To load data into a table, you must have either SYSADM or DBADM authority.

There are restrictions and limitations with the LOAD utility:

Attempts to create or drop tables in a table space that is in load pending state will fail.
You cannot load data into a database accessed through DB2 Connect or to a downlevel server other than DB2 Version 2.
Note: Options only available with this release of DB2 cannot be used with a server from the previous release.
If a REPLACE is performed on a table and an error occurs during the LOAD, the original data in the table is lost. Keep the LOAD input to allow the LOAD to be restarted if an error occurs during the process.
If there is no value for a NOT NULL column, the row is rejected since the LOAD utility attempts to load a NULL value. This condition also applies when there is no value for a NOT NULL WITH DEFAULT column. If a NOT NULL WITH DEFAULT column is not one of the columns being loaded, it is filled with the DEFAULT values.
Triggers are not activated on newly-loaded rows. Any business rules forming these triggers are not enforced by the LOAD command.

The LOAD utility optionally updates table and index statistics as part of the load process if run in REPLACE mode. If data is appended into a table, statistics are not gathered for the table. Run RUNSTATS following the completion of the load process to collect up-to-date statistics for the table. If gathering statistics on a table with a unique index, and duplicate keys are deleted during the Delete phase, then the statistics are not updated to account for the deleted records. If you think you will have duplicate records, you should not collect statistics during the LOAD but run RUNSTATS after the LOAD process.

To ensure the loaded data doesn't cause referential integrity or check constraint violations, any loaded table that is a parent table will cause all dependent tables to be placed in a check pending state. To remove the table from the restricted state, you must run the SET CONSTRAINTS statement.

With LOAD, there is a possibility of unequal code page situations involving possible expansion or contraction of the character data. Such situations could occur with Japanese or Traditional-Chinese Extended UNIX Code (EUC) and double-byte character sets (DBCS) which may have different length encodings for the same character. An option, NOCHECKLENGTHS, is used to toggle between two situations:

Comparison of input data length to target column length is performed before reading in any data. If the input length is larger than the target, NULLs are inserted for that column if it is nullable. Otherwise, the request is rejected. This is the default.
No initial comparison is performed and, on a row-by-row basis, an attempt is made to load the data. If the data is too long after translation is complete, the row is rejected. Otherwise, the data is loaded. Specifying NOCHECKLENGTHS will enable this behavior.

LOAD Performance Considerations

The performance of LOAD depends on the nature and size of the data, the number of indexes, the options used, and whether the SET CONSTRAINTS statement is required. Use of SET CONSTRAINTS lengthens the total time needed to load the table and make it usable again. (For more information on the SET CONSTRAINTS statement, see the SQL Reference manual.)

The LOAD utility performs almost equally well in either INSERT mode or REPLACE mode.

Index creation will reduce the performance of the load process, especially when data is added to a table already containing data. If there are many indexes on a table which already has a large amount of data and only a small percentage of data to be loaded, you should consider using the IMPORT utility instead of the LOAD utility. Unique indexes also reduce the performance of the load process if duplicates are encountered. In most cases it is still more efficient to create the index during the LOAD than to complete the LOAD and then use the CREATE INDEX command for each of the indexes.

The LOAD utility automatically attempts to provide the best performance possible by determining the best values for DISK_PARALLELISM, CPU_PARALLELISM, DATA_BUFFER, and SORT_BUFFER if these parameters have not be specified by the user at the time the utility is run. Optimization of these values is done based on the size and the free space available in the utility heap. Consider allowing the LOAD utility to choose the values for these parameters and then attempt to tune the parameters for your particular needs.

Performance of the LOAD can be improved by:

Using the FASTPARSE option in the MODIFIED BY parameter reduces the data checking done on the user-supplied column values, and performance is enhanced. This option should only be used when the data being loaded is known to be valid. This may improve the performance of the LOAD process by about 10 or 20 percent.
Specifying as many devices for the temporary sort directories as you can so that there is opportunity for parallel I/O during sorting. When very large sort areas are required on systems with restricted file sizes then multiple directories are required.
Specifying as large a value for the sort buffer as possible. The sort buffer memory is allocated from the utility heap. Ensure that the utility heap is defined to be large enough.
Using CPU_PARALLELISM during the LOAD so that you can choose to preserve or not preserve the order of the input data being loaded. The default is to preserve order. Specify ANYORDER on the LOAD command to override this default and gain additional performance improvements.
Note: The additional performance improvements would be to the create index time, but may hurt in the query performance time if the original data was in clustered index sequence.
Using the SORT BUFFER and DATA BUFFER parameters. The SORT BUFFER will improve performance only if indexes are created when using LOAD. The DATA BUFFER parameter specifies the total amount of memory allocated to the LOAD utility as a buffer. The SORT BUFFER parameter specifies the amount of memory allocated for sorting index keys. (This assumes there is real storage to handle the increased buffer size and prevent increased paging.) Both buffer allocations come from the utility heap. You can modify the util_heap_sz database configuration parameter accordingly. See "Utility Heap Size (util_heap_sz)".
Specifying multiple directories in different file systems for the USING parameter. Specifying multiple directories improves performance only if indexes are created during the LOAD.
Controlling the parallelism used during the LOAD in a machine environment where symmetric multi-processor (SMP) exploitation is possible. You can control parallelism in two ways:
- By using the parameter CPU_PARALLELISM. This parameter controls the degree of paralellism used when parsing, converting, and formatting records.
  Note: When tables include either LOB or LONG VARCHAR data, CPU_PARALLELISM is set to one. Parallelism is not supported in this case.
- By using the parameter DISK_PARALLELISM. This parameter controls the degree of parallelism used when writing data to the table space containers.
algorithmic maximum. If the number for these parameters is zero (0), or is not specified, the default is determined at the time the utility is run based on database and system configurations and data characteristics. For more information on the LOAD command, see the Command Reference manual.
Using the BINARY NUMERICS and PACKED DECIMAL parameters when loading binary data in ASC data files, to improve the load time involving numeric data.
Installing high performance sorting libraries from third party vendors to create indexes during the load. Examples of third party sort products include: SMARTsort** and SyncSort**. This is done through the use of the DB2SORT environment variable. See "Establish Environment Variables and the Profile Registry" for more information on environment variables,

The COPY YES/NO option specifies whether to create a copy of the input data during LOAD or not. If "YES" is chosen, performance is reduced because all the data being loaded is copied at the same time. This choice is faster than accepting a backup pending state and having to do a backup later before accessing the table. If "NO" is chosen, and forward recovery in enabled, then the table is placed in a backup pending state.

LOAD Temporary Space Limitations

When creating indexes during the LOAD, you require at least as much disk space as the sum of the index sizes and possibly twice as much. The space used is temporary; that is, it is located outside the database in the directories specified for the USING parameter or in the tmp directory defined by the DB2INSTPROF environment variable.

Restarting LOAD and Database Recovery

A LOAD can be restarted following a failure. A copy image of the loaded data can be created for use when recovering a database.

The following discuss these considerations in more detail:

"Restarting the LOAD"
"Creating a Copy Image of Loaded Data"

Restarting the LOAD

If a failure occurs while loading data, you can restart the load from the last save point or point of consistency; or reload the entire table by using the REPLACE option.

The remote file specified in the LOAD restart operation should be the one that was specified in the LOAD command that failed.

There are a number of options available should you decide to restart the load.

If you decide to restart using the RESTARTCOUNT number option, then you must use the number of rows at the last successful consistency point. To determine that value, use the LOAD QUERY command with either the name specified with the REMOTE FILE option or the default name db2utmp. By choosing RESTARTCOUNT number, the LOAD restarts from the row following the row identified by the number and attempts to finish the load.
Note: The RESTARTCOUNT number can only be used with the last successfully completed consistency point. If the last consistency point started but did not complete (that is, SQL3519W is not followed by SQL3520W), then you must carry out the action as described in the help for message SQL3519W.

If you do not want to continue loading rows, or if the failure was during the Build phase, you can use the RESTARTCOUNT B option. The LOAD process brings the table to the state of the last save point or point of consistency and then restarts the Build phase. By choosing this option the LOAD restarts, does not attempt to load additional rows, and builds the indexes for the rows already loaded.

If the message file states that the Build phase completed and all temporary files are unmodified as left by the LOAD, you can use the RESTARTCOUNT D option. The information on the rows containing duplicate keys stored in the temporary files is used to delete those rows.

The restarted LOAD command should continue until the completion of the LOAD process.
Note: For minor errors such as nonexistent data files or an invalid dcoldata, the LOAD will clean up and take the table out of the load pending state. You must do the LOAD again in either REPLACE or INSERT mode with correct parameters.

Creating a Copy Image of Loaded Data

If the table being loaded is part of a recoverable database, then logging is in effect. Since LOAD does not log the changes made to the table, you have the option of specifying COPY YES to create a copy of the data being loaded. This copy is used during roll-forward recovery to re-create the changes to the database done by LOAD.

When using this option you should also consider using multiple devices or directories to allow for the best possible I/O exploitation.

For more information on the load copy location file, see "Using the Load Copy Location File".

If forward recovery is enabled (logretain or userexit is "on") and the COPY option was not used, all table spaces in which the loaded table resides are left in a backup pending state. A backup of the database or the table space(s) is required to remove this pending status. The backup is done before any other units of work against the database or table space can be started.

For more information on how to recover, see Chapter 6. "Recovering a Database".

During forward recovery, if the load copy is not available, then the table spaces (of the loaded table) which are being rolled forward will be set to the restore pending state. These table spaces must be restored from a backup image taken after the table load.

An error is returned if you specify NONRECOVERABLE and COPY YES. There is no need for a copy in such a case since it would not be needed during recovery.

LOAD Exception Table

The exception table is a user-created table which mimics the definition of the table being loaded. It is specified by the FOR EXCEPTION option on the LOAD command. The table is used to store copies of rows that violate unique index rules.
Note: Any rows rejected before the building of the index on the loaded table because of invalid data are not inserted into the exception table.

Rows are added to the existing information in the exception table. The existing information may include rows listing check constraint or foreign key violations; or invalid rows from a previous LOAD. If you want only the invalid rows from this LOAD, you will need to remove the existing rows before invoking LOAD.

The exception table used with the LOAD utility is identical to the exception table(s) used by the SET CONSTRAINTS statement. An exception table should be created to perform a load which has a unique index and may have duplicate records. If an exception table is not provided for the LOAD, and duplicate records are found, then the LOAD will continue. However, only a warning message is issued about the deleted duplicate records and the deleted duplicate records are not placed anywhere.

After completing the load, information in the exception table can be used any way you wish. You may want to use the information to correct any data that was in error and insert the rows into the original table.

The exception table message column has the following structure:

Table 22. Exception Table Message Column Structure for LOAD

Field number Contents Size Comments
1 Number of violations 5 characters Right justified padded with '0'
2 Type of first violation. Only "I" is used by LOAD 1 character 'I' - Unique Index violation
3 Length of constraint/index token 5 characters Right justified padded with '0'
4 Constraint/index token length from the previous field

Note: Only Unique Index violations will be reported by LOAD. The Check Constraint and Foreign Key violations will be reported by running the SET CONSTRAINTS statement with the IMMEDIATE CHECKED FOR EXCEPTION options. Only one unique index violation in a row is reported. LONG or Large Object (LOB) data is not inserted into the exception table. Index token is the "IID" value from SYSCAT.INDEXES that identifies the index.

Checking For Constraint Violations

The loaded table may be in the check pending state if it has table check constraints or referential integrity constraints defined on it. The STATUS flag of the SYSCAT.TABLES in the row corresponding to the loaded table indicates the check pending state of the table. For the loaded table to be usable, the STATUS must have a value of "N" indicating the normal state of the table.

To remove a table from the check pending state, use the SET CONSTRAINT statement. For more information on the SET CONSTRAINTS statement, see the SQL Reference manual. One or more tables may be submitted to be checked in a single invocation. For a dependent table to be checked, the parent table must not be in the check pending state. In the case of a referential integrity cycle, all the tables involved in the cycle must be included in a single invocation of the SET CONSTRAINTS statement.

To manage the loading of several tables, consider the position of each within referential relationships along with the table size and time windows available to carry out the load. It may be convenient, for example, to check the parent table for check constraint violations while the dependent table is loaded. This can only occur if the two tables are not in the same table space.

Exception tables are convenient for a consolidated report of all the rows that have constraints violated. If the exception table option is not used, only the first violation is reported. This may be a cause for frustration when dealing with large tables having more than one constraint violation. The same exception table used for the LOAD utility may be used for checking constraint violations. As with the LOAD utility, there is no checking done when running the SET CONSTRAINTS statement to ensure that the exception table is empty. The extra timestamp column in the exception table may be used to distinguish newly-inserted rows from the old ones, if necessary.

The SET CONSTRAINTS statement does not activate any DELETE triggers as a result of deleting rows that violate constraints. It must be noted, however, that once the table is removed from the check pending state, triggers are active. This implies that, if we correct data and INSERT rows from the exception table into the loaded table, any INSERT trigger defined on the table will be activated. The implications of this on the data should be considered and, if necessary, suitable action should be taken. One option is to drop the INSERT trigger, INSERT rows from the exception table, and then recreate the INSERT trigger.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]