This section describes National Language Support (NLS) support issues that you must consider for your applications. The major topics discussed are:
The database manager compares character data using a collating sequence. This is an ordering for a set of characters that determines whether a particular character sorts higher, lower, or the same as another.
Note: | Character string data defined with the FOR BIT DATA attribute, or BLOB data, is sorted using the binary sort sequence. |
For example, a collating sequence can be used to indicate that lowercase and uppercase versions of a particular character are to be sorted equally.
The database manager allows databases to be created with custom collating sequences. The following sections help you determine and implement a particular collating sequence for a database.
In a database, each single-byte character is represented internally as a unique number between 0 and 255, (in hexadecimal notation, between X'00' and X'FF'). This number is referred to as the code point of the character. A collating sequence is a mapping between the code point and the desired position of each character in a sorted sequence. The numeric value of the position is called the weight of the character in the collating sequence. The simplest collating sequence is one where the weights are identical to the code points. This is called the identity sequence.
For example, consider the characters B (X'42'), and b (X'62'). If, according to the collating sequence table, they both have a sort weight of X'42' (B), then they collate the same. If the sort weight for B is X'9E' and the sort weight for b is X'9D', then b will be sorted before B. Actual weights depend on the collating sequence table used which depends on the code set and locale. Note that a collating sequence table is not the same as a code page table which defines code points.
Consider the following example. In ASCII, the characters A through Z are represented by X'41' through X'5A'. To describe a collating sequence where these are sorted in order, and consecutively (no intervening characters), you can write X'41', X'42', ...X'59', X'5A'.
For multi-byte characters, the hexadecimal value of the multi-byte character is also used as the weight. For example, X'8260', X'8261' are the code points for double byte character A and B. In this case, you can write X'8260', X'8261' as the collating sequence for double byte characters A and B. These are also the code points for A and B.
The values of the weights in a collating sequence need not be unique. For example, you could give uppercase letters and their lowercase equivalents the same weight.
Specifying the collating sequence can be simplified if a collating sequence provides weights for all 256 code points. The weight of each character can be determined using the code point of the character. This is the method used to specify a collating sequence for the database manager: a string of 256 bytes, where the nth byte (starting with 0) contains the weight of code point n.
In the case of multi-byte character sets, DB2 simply uses the code point as the collating value. Multi-byte characters therefore sort the way they appear in their code point table.
Once a collating sequence is established, character comparison is performed by comparing the weights of two characters, instead of directly comparing their code point values.
If weights that are not unique are used, characters that are not identical may compare equally. Because of this, string comparison must be a two-phase process:
If the collating sequence contains 256 unique weights, only the first step is performed. If the collating sequence is the identity sequence only the second step is performed. In either case, there is a performance benefit.
For more information on character comparisons, see the SQL Reference.
To perform character comparisons that are independent of whether they are upper or lower case, you can use the TRANSLATE function to select and compare mixed case column data by translating it to upper case, but only for the purposes of comparison. Consider the following data:
Abel abels ABEL abel ab Ab
For the following select statement:
SELECT c1 FROM T1 WHERE TRANSLATE(c1) LIKE 'AB%'
you would receive the following results:
ab Ab abel Abel ABEL abels
Note: | You could also set the select as in the following view v1, and then make all your comparisons against the view (in upper case) and your inserts into the table in mixed case: |
CREATE VIEW v1 AS SELECT TRANSLATE(c1) FROM t1
At the database level, you can set the collating sequence as part of the CREATE DATABASE API. This allows you to decide if 'a' is processed before 'A', or if 'A' is processed after 'a', or if they are processed with equal weighting. This will make them equal when collating or sorting using the ORDER BY clause. If you have two values of 'a' and 'A', 'A' will always come before 'a', because in all senses they are equal, so the only difference upon which to sort is the hexadecimal value.
Thus if you issue SELECT c1 FROM t1 WHERE c1 LIKE 'ab%', you receive the following output:
ab abel abels
If you issue SELECT c1 FROM t1 WHERE c1 LIKE 'A%', you receive the following output:
Abel Ab ABEL
If you issue SELECT c1 FROM t1 ORDER BY c1, you receive the following:
ab Ab abel Abel ABEL abels
Thus, you may want to consider using the scalar function TRANSLATE(), as well as the CREATE DATABASE API. Note that you can only specify a collating sequence using the CREATE DATABASE API. You cannot specify a collating sequence from the Command Line Processor. For information on the TRANSLATE() function, see the SQL Reference. For information on the CREATE DATABASE API see the API Reference.
You can also use the UCASE function as follows, but note that DB2 performs a table scan instead of using an index for the select:
SELECT * FROM EMP WHERE UCASE(JOB) = 'NURSE'
The collating sequence for a database is specified at database creation time. Once the database has been created, the collating sequence cannot be changed.
The CREATE DATABASE API accepts a data structure called the Database Descriptor Block (SQLEDBDESC). You can define your own collating sequence within this structure.
To specify a collating sequence for a database:
The SQLEDBDESC structure contains:
Note: | These constants are defined in the SQLENV include file. |
Several sample collating sequences are provided (as include files) to facilitate database creation using the EBCDIC collating sequences instead of the default workstation collating sequence.
The collating sequences in these include files can be specified in the SQLDBUDC field of the SQLEDBDESC structure. They can also be used as models for the construction of other collating sequences.
For information on the include files that contain collating sequences, see the following sections:
Once a collating sequence is defined, all future character comparisons for that database will be performed with that collating sequence. Except for character data defined as FOR BIT DATA or BLOB data, the collating sequence will be used for all SQL comparisons and ORDER BY clauses, and also in setting up indexes and statistics. For more information on how the database collating sequence is used, see the section on String Comparisons in the SQL Reference, S10J-8165-00.
Potential problems may occur in the following cases:
A final point to remember is that the results of any sort based on a direct comparison of characters will only match the results of a query ordered using an identity collating sequence.
The application code page is derived from the active environment when the database connection is made. If the DB2CODEPAGE environment variable is set, its value is taken as the application code page. The database code page is derived from the value specified (explicitly or by default) at the time the database is created. The following defines how the active environment is determined in different operating environments, for example:
For a complete list of environment mappings for code page values, see the Table 30.
Locales are specific to UNIX-based operating systems. There are two locales:
When your program is started, it gets a default C locale. It does not get a copy of the environment locale. Your program has a few choices:
With UNIX, the active locale used by DB2 is determined from the LC_CTYPE portion of the locale. For details, see the NLS documentation for your operating system.
It is strongly recommended that applications be precompiled, bound, compiled, and executed using the same code page. This is because data conversions by the server can occur in both the bind and the execution phases. Users should ensure that the same conversion tables are used by binding and executing with the same active code page. For a discussion of how applications determine the active code page, see "Deriving Code Page Values".
Any external data obtained by the application will be assumed to be in the application code page. This includes data obtained from a file or from user input. Make sure that data from sources outside the application uses the same code page as the application.
If you use host variables that use graphic data in your C or C++ applications, there are special precompiler, application performance, and application design issues you need to consider. For a detailed discussion of these considerations, see "Handling Graphic Host Variables". If you deal with EUC code sets in your applications, refer to "Japanese and Traditional-Chinese EUC Code Set Considerations" for guidelines that you should consider.
The coding of SQL statements is not language dependent. The SQL keywords must be typed as shown in this book, although they may be typed in uppercase, lowercase, or mixed case. The names of database objects, host variables and program labels that occur in an SQL statement cannot contain characters outside the extended character set supported by your code page. See the SQL Reference for more information about extended character sets.
Constant character strings in static SQL statements are converted at bind time, from the application code page to the database code page, and will be used at execution time in this database code page representation. To avoid such conversions if they are not desired, you can use host variables in place of string constants.
The server does not convert file names. To code a file name, either use the ASCII invariant set, or provide the path in the hexadecimal values that are physically stored in the file system.
In a multi-byte environment, there are four characters which are considered special that do not belong to the invariant character set. These characters are:
The code points for each of these characters, by code page is as
follows:
Table 13. Code Points for Special Double-byte Characters
Code Page | Double-Byte Percentage | Double-Byte Underscore | Double-byte Space | Double-Byte Substitution Character |
---|---|---|---|---|
932 | X'8193' | X'8151' | X'8140' | X'FCFC' |
938 | X'8193' | X'8151' | X'8140' | X'FCFC' |
942 | X'8193' | X'8151' | X'8140' | X'FCFC' |
943 | X'8193' | X'8151' | X'8140' | X'FCFC' |
948 | X'8193' | X'8151' | X'8140' | X'FCFC' |
949 | X'A3A5' | X'A3DF' | X'A1A1' | X'AFFE' |
950 | X'A248' | X'A1C4' | X'A140' | X'C8FE' |
954 | X'A1F3' | X'A1B2' | X'A1A1' | X'F4FE' |
964 | X'A2E8' | X'A2A5' | X'A1A1' | X'FDFE' |
970 | X'A3A5' | X'A3DF' | X'A1A1' | X'AFFE' |
1381 | X'A3A5' | X'A3DF' | X'A1A1' | X'FEFE' |
1383 | X'A3A5' | X'A3DF' | X'A1A1' | X'A1A1' |
13488 | X'FF05' | X'FF3F' | X'3000' | X'FFFD' |
The DBCS substitution character is used to replace any EUC non-SBCS character as required. There is no concept of a three or four byte substitution character.
When coding stored procedures that will be running remotely, the following considerations apply:
By default, when you invoke DB2 DARI stored procedures and UDFs, they run under a default national language environment which may not match the database's national language environment. Consequently, using country or code page specific operations, such as the C wchar_t graphic host variables and functions, may not work as you expect. You need to ensure that, if applicable, the correct environment is initialized when you invoke the stored procedure or UDF.
Package names are determined when you invoke the PRECOMPILE PROGRAM command or API. By default, they are generated based on the first 8-bytes of the application program source file (without the file extension) and are folded to upper case. Optionally, a name can be explicitly defined. Regardless of the origin of a package name, if you are running in an unequal code page environment, the characters for your package names should be in the invariant character set. Otherwise you may experience problems related to the modification of your package name. The database manager will not be able to find the package for the application or a client-side tool will not display the right name for your package.
A package name modification due to character conversion will occur if any of the characters in the package name, or fragments of a character in a multi-byte code page, are not directly mapped to a valid character in the database code page. In such cases, a substitution character replaces the character that is not converted. After such a modification, the package name, when converted back to the application code page, may not match the original package name. An example of a case where this behavior is undesirable is when you use the DB2 Database Director to list and work with packages. Package names displayed may not match the expected names.
To avoid conversion problems with package names, ensure that only characters are used which are valid under both the application and database code pages.
At precompile/bind time, the precompiler is the executing application. The active code page when the database connection was made prior to the precompile request is used for precompiled statements, and any character data returned in the SQLCA.
At execution time, the active code page of the user application when a database connection is made is in effect for the duration of the connection. All data is interpreted based on this code page; this includes dynamic SQL statements, user input data, user output data, and character fields in the SQLCA.
Failure to follow these guidelines may produce unpredictable results. These conditions cannot be detected by the database manager, so no error or warning message will result. For example, a C application contains the following SQL statements operating against a table T1 with one column defined as C1 CHAR(20):
(0) EXEC SQL CONNECT TO GLOBALDB; (1) EXEC SQL INSERT INTO T1 VALUES ('a-constant'); strcpy(sqlstmt, "SELECT C1 FROM T1 WHERE C1='a-constant'); (2) EXEC SQL PREPARE S1 FROM :sqlstmt; Where: application code page at bind time = x application code page at execution time = y database code page = z
At bind time, 'a-constant' in statement (1) is converted from code page x to code page z. This conversion can be noted as (x>z).
At execution time, 'a-constant' (x>z) is inserted into the table when statement (1) is executed. However, the WHERE clause of statement (2) will be executed with 'a-constant' (y>z). If the code points in the constant are such that the two conversions (x>z and y>z) yield different results, the SELECT in statement (2) will fail to retrieve the data inserted by statement (1).
Ideally, for optimal performance, your applications should always use the same code page as your database. However, this is not always practical or possible. The DB2 products provide support for character conversion that allows your application and database to use different code pages. Characters from one code page must be mapped to the other code page in order to maintain meaning of the data.
Character conversion can occur in the following situations:
This database conversion will occur on the database server machine for both conversions from the application code page to the database code page and from the database code page to the application code page.
You can minimize or eliminate client/server character conversion in some situations. For example, you could:
If a Windows ODBC application is used with the IBM DB2 ODBC driver in Windows database client, this problem may be alleviated by the use of the TRANSLATEDLL and TRANSLATEOPTION keywords in the odbc.ini or db2cli.ini file.
Note: | The DB2 for OS/2 Version 1.0 or Version 1.2 database server does not support character conversion between different code pages. Ensure that the code pages on server and client are compatible. See Table 30 for the code page conversons that are supported. |
This data conversion will occur on the database client machine before the client accesses the database server. Additional data conversion may take place if the application is running in a code page that is different from the code page of the database (as stated in the previous point).
Data conversion, if any, also depends on how the import utility was called. See the Administration Guide for more information.
Character conversion will not occur for:
See the SQL Reference for unequal code page rules for assigning, comparing, and combining strings.
When your application converts from one code page to another, it is possible that one or more characters are not represented in the target code page. If this occurs, DB2 inserts a substitution character into the target string in place of the character that has no representation. The replacement character is then considered a valid part of the string. In situations where a substitution occurs, the SQLWARN10 indicator in the SQLCA is set to 'W'.
Note: | Any character conversions resulting from using the WCHARTYPE CONVERT precompiler option will not flag a warning if any substitutions take place. |
When data conversion occurs, conversion will take place from a source code page to a target code page.
The source code page is determined from the source of the data; data from the application has a source code page equal to the application code page, and data from the database has a source code page equal to the database code page.
The determination of target code page is more involved; where the data is to be placed, including rules for intermediate operations, is considered:
Exercise caution in situations where two conversion steps might occur. Make sure you follow the supported character conversions listed in Table 30 to avoid a possible loss of character data. Additionally, within each group, only characters which exist in both the source and target code page have meaningful conversions. Other characters are used as "substitutions" and are only useful for converting from the target code page back to the source code page (and may provide meaningless conversions in the second stage).
Table 30 shows the code page conversions that are supported. Any code page can be converted to any other code page that is listed in the same IBM-defined language group. For example, code page 437 can be converted to 37, 819, 850, 1051, 1252, or 1275.
Note: | Character string conversions between multi-byte code pages, for example DBCS and EUC, may result in either an increase or a decrease in the length of the string. |
When your application successfully completes an attempt to connect to a DB2 database server, you should consider the following fields in the returned SQLCA:
The considerations for graphic string data should not be a factor in unequal code page situations. Each string always has the same number of characters, regardless of whether the data is in the application or the database code page.
See "Unequal Code Page Situations" for information on dealing with unequal code page situations.
(6) However, a literal inserted into a column defined as FOR BIT DATA could be converted if that literal was part of an SQL statement which was converted.