Embedded SQL Programming Guide

National Language Support Considerations

This section describes National Language Support (NLS) support issues that you must consider for your applications. The major topics discussed are:

"Collating Sequences"
"Deriving Code Page Values"
"Deriving Locales in Application Programs"
"Programming Considerations"
"Conversion Between Different Code Pages"

Collating Sequences

The database manager compares character data using a collating sequence. This is an ordering for a set of characters that determines whether a particular character sorts higher, lower, or the same as another.
Note: Character string data defined with the FOR BIT DATA attribute, or BLOB data, is sorted using the binary sort sequence.

For example, a collating sequence can be used to indicate that lowercase and uppercase versions of a particular character are to be sorted equally.

The database manager allows databases to be created with custom collating sequences. The following sections help you determine and implement a particular collating sequence for a database.

Overview

In a database, each single-byte character is represented internally as a unique number between 0 and 255, (in hexadecimal notation, between X'00' and X'FF'). This number is referred to as the code point of the character. A collating sequence is a mapping between the code point and the desired position of each character in a sorted sequence. The numeric value of the position is called the weight of the character in the collating sequence. The simplest collating sequence is one where the weights are identical to the code points. This is called the identity sequence.

For example, consider the characters B (X'42'), and b (X'62'). If, according to the collating sequence table, they both have a sort weight of X'42' (B), then they collate the same. If the sort weight for B is X'9E' and the sort weight for b is X'9D', then b will be sorted before B. Actual weights depend on the collating sequence table used which depends on the code set and locale. Note that a collating sequence table is not the same as a code page table which defines code points.

Consider the following example. In ASCII, the characters A through Z are represented by X'41' through X'5A'. To describe a collating sequence where these are sorted in order, and consecutively (no intervening characters), you can write X'41', X'42', ...X'59', X'5A'.

For multi-byte characters, the hexadecimal value of the multi-byte character is also used as the weight. For example, X'8260', X'8261' are the code points for double byte character A and B. In this case, you can write X'8260', X'8261' as the collating sequence for double byte characters A and B. These are also the code points for A and B.

The values of the weights in a collating sequence need not be unique. For example, you could give uppercase letters and their lowercase equivalents the same weight.

Specifying the collating sequence can be simplified if a collating sequence provides weights for all 256 code points. The weight of each character can be determined using the code point of the character. This is the method used to specify a collating sequence for the database manager: a string of 256 bytes, where the nth byte (starting with 0) contains the weight of code point n.

In the case of multi-byte character sets, DB2 simply uses the code point as the collating value. Multi-byte characters therefore sort the way they appear in their code point table.

Character Comparisons

Once a collating sequence is established, character comparison is performed by comparing the weights of two characters, instead of directly comparing their code point values.

If weights that are not unique are used, characters that are not identical may compare equally. Because of this, string comparison must be a two-phase process:

Compare the characters of each string based on their weights.
If step 1 yielded equality, compare the characters of each string based on their code point values.

If the collating sequence contains 256 unique weights, only the first step is performed. If the collating sequence is the identity sequence only the second step is performed. In either case, there is a performance benefit.

For more information on character comparisons, see the SQL Reference.

Case Independent Comparisons

To perform character comparisons that are independent of whether they are upper or lower case, you can use the TRANSLATE function to select and compare mixed case column data by translating it to upper case, but only for the purposes of comparison. Consider the following data:

  Abel
  abels
  ABEL
  abel
  ab
  Ab

For the following select statement:

SELECT c1 FROM T1 WHERE TRANSLATE(c1) LIKE 'AB%'

you would receive the following results:

  ab
  Ab
  abel
  Abel
  ABEL
  abels

Note: You could also set the select as in the following view v1, and then make all your comparisons against the view (in upper case) and your inserts into the table in mixed case:

  CREATE VIEW v1 AS SELECT TRANSLATE(c1) FROM t1

At the database level, you can set the collating sequence as part of the CREATE DATABASE API. This allows you to decide if 'a' is processed before 'A', or if 'A' is processed after 'a', or if they are processed with equal weighting. This will make them equal when collating or sorting using the ORDER BY clause. If you have two values of 'a' and 'A', 'A' will always come before 'a', because in all senses they are equal, so the only difference upon which to sort is the hexadecimal value.

Thus if you issue SELECT c1 FROM t1 WHERE c1 LIKE 'ab%', you receive the following output:

  ab
  abel
  abels

If you issue SELECT c1 FROM t1 WHERE c1 LIKE 'A%', you receive the following output:

  Abel
  Ab
  ABEL

If you issue SELECT c1 FROM t1 ORDER BY c1, you receive the following:

  ab
  Ab
  abel
  Abel
  ABEL
  abels

Thus, you may want to consider using the scalar function TRANSLATE(), as well as the CREATE DATABASE API. Note that you can only specify a collating sequence using the CREATE DATABASE API. You cannot specify a collating sequence from the Command Line Processor. For information on the TRANSLATE() function, see the SQL Reference. For information on the CREATE DATABASE API see the API Reference.

You can also use the UCASE function as follows, but note that DB2 performs a table scan instead of using an index for the select:

    SELECT * FROM EMP WHERE UCASE(JOB) = 'NURSE'

Specifying a Collating Sequence

The collating sequence for a database is specified at database creation time. Once the database has been created, the collating sequence cannot be changed.

The CREATE DATABASE API accepts a data structure called the Database Descriptor Block (SQLEDBDESC). You can define your own collating sequence within this structure.

To specify a collating sequence for a database:

Pass the desired SQLEDBDESC structure, or
Pass a NULL pointer. The collating sequence of the operating system (based on current country code and code page) is used. This is the same as specifying SQLDBCSS equal to SQL_CS_SYSTEM (0).

The SQLEDBDESC structure contains:

SQLDBCSS

A 4-byte integer indicating the source of the database collating sequence. Valid values are:

SQL_CS_SYSTEM: The collating sequence of the operating system (based on current country code and code page) is used.
SQL_CS_USER: The collating sequence is specified by the value in the SQLDBUDC field.
SQL_CS_NONE: The collating sequence is the identity sequence. Strings are compared byte for byte, starting with the first byte, using a simple binary comparison.

Note:

These constants are defined in the SQLENV include file.

SQLDBUDC

A 256-byte field. The nth byte contains the sort weight of the nth character in the code page of the database. If SQLDBCSS is not equal to SQL_CS_USER, this field is ignored.

Sample Collating Sequences

Several sample collating sequences are provided (as include files) to facilitate database creation using the EBCDIC collating sequences instead of the default workstation collating sequence.

The collating sequences in these include files can be specified in the SQLDBUDC field of the SQLEDBDESC structure. They can also be used as models for the construction of other collating sequences.

For information on the include files that contain collating sequences, see the following sections:

For C/C++, "Include Files"
For COBOL, "Include Files"
For FORTRAN, "Include Files".

Other Concerns

Once a collating sequence is defined, all future character comparisons for that database will be performed with that collating sequence. Except for character data defined as FOR BIT DATA or BLOB data, the collating sequence will be used for all SQL comparisons and ORDER BY clauses, and also in setting up indexes and statistics. For more information on how the database collating sequence is used, see the section on String Comparisons in the SQL Reference, S10J-8165-00.

Potential problems may occur in the following cases:

An application merges sorted data from a database with application data that was sorted using a different collating sequence.
An application merges sorted data from one database with sorted data from another, but the databases have different collating sequences.
An application makes assumptions about sorted data that are not true for the relevant collating sequence. For example, numbers collating lower than alphabetics might or might not be true for a particular collating sequence.

A final point to remember is that the results of any sort based on a direct comparison of characters will only match the results of a query ordered using an identity collating sequence.

Deriving Code Page Values

The application code page is derived from the active environment when the database connection is made. If the DB2CODEPAGE environment variable is set, its value is taken as the application code page. The database code page is derived from the value specified (explicitly or by default) at the time the database is created. The following defines how the active environment is determined in different operating environments, for example:

UNIX: In UNIX-based environments, the active environment is determined from the locale environment variables, which include information about language, territory and code set.
OS/2: In OS/2, primary and secondary code pages are specified in the CONFIG.SYS file. You can use the chcp command to display and dynamically change code pages within a given session.
DOS: In DOS, the active code page is determined by the value specified in the COUNTRY command in the CONFIG.SYS file. You can use the chcp command to display and dynamically change code pages within a given session.
Macintosh: For the Macintosh operating system, if the DB2CODEPAGE environment variable is not set, the Macintosh code page is derived from the Regional version code from the installed script.
Windows: For Windows, if the DB2CODEPAGE environment variable is not set, the Windows code page is derived from the country ID, as specified in the iCountry value in the [intl] section of the Windows WIN.INI file.
Windows 95: For Windows 95, if the DB2CODEPAGE environment variable is not set, the Windows 95 code page is derived from the ANSI code page setting in the Registry.
Windows NT: For Windows NT, if the DB2CODEPAGE environment variable is not set, the Windows NT code page is derived from the ANSI code page setting in the Registry.

For a complete list of environment mappings for code page values, see the Table 30.

Deriving Locales in Application Programs

Locales are specific to UNIX-based operating systems. There are two locales:

The environment locale allows you to specify the language, currency symbol, and so on, that you want to use.
The program locale contains the current language, currency symbol, and so on, of a program that is executing.

When your program is started, it gets a default C locale. It does not get a copy of the environment locale. Your program has a few choices:

Ignore the environment locale. Your program could hard code some options. For example, your program could set the language to spanish with the setlocale() function.
Copy the environment locale to the program locale.
Ignore the environment locale. Use whatever defaults you get from the operating system.

How DB2 Derives Locales

With UNIX, the active locale used by DB2 is determined from the LC_CTYPE portion of the locale. For details, see the NLS documentation for your operating system.

If LC_CTYPE of the program locale has a value other than that of 'C', DB2 will use this value to determine the application code page by mapping it to its corresponding code page.
If LC_CTYPE has the value of 'C' (the 'C' locale), DB2 will set the program locale according to the environment locale using the setlocale() function.
If LC_CTYPE still has a value of 'C', DB2 will use its default locale for that platform. See either the Building Applications for Windows and OS/2 Environments or the Building Applications for UNIX Environments book for information on the default locale for that platform.
If LC_CTYPE's value is no longer 'C', its new value will be used to map to a corresponding code page.

Programming Considerations

It is strongly recommended that applications be precompiled, bound, compiled, and executed using the same code page. This is because data conversions by the server can occur in both the bind and the execution phases. Users should ensure that the same conversion tables are used by binding and executing with the same active code page. For a discussion of how applications determine the active code page, see "Deriving Code Page Values".

Any external data obtained by the application will be assumed to be in the application code page. This includes data obtained from a file or from user input. Make sure that data from sources outside the application uses the same code page as the application.

If you use host variables that use graphic data in your C or C++ applications, there are special precompiler, application performance, and application design issues you need to consider. For a detailed discussion of these considerations, see "Handling Graphic Host Variables". If you deal with EUC code sets in your applications, refer to "Japanese and Traditional-Chinese EUC Code Set Considerations" for guidelines that you should consider.

Coding SQL Statements

The coding of SQL statements is not language dependent. The SQL keywords must be typed as shown in this book, although they may be typed in uppercase, lowercase, or mixed case. The names of database objects, host variables and program labels that occur in an SQL statement cannot contain characters outside the extended character set supported by your code page. See the SQL Reference for more information about extended character sets.

Constant character strings in static SQL statements are converted at bind time, from the application code page to the database code page, and will be used at execution time in this database code page representation. To avoid such conversions if they are not desired, you can use host variables in place of string constants.

The server does not convert file names. To code a file name, either use the ASCII invariant set, or provide the path in the hexadecimal values that are physically stored in the file system.

In a multi-byte environment, there are four characters which are considered special that do not belong to the invariant character set. These characters are:

The double-byte percentage and double-byte underscore characters used in LIKE processing. For further details concerning LIKE, refer to the SQL Reference.
The double-byte space character, used for, among other things, blank padding in graphic strings.
The double-byte substitution character, used as a replacement during character conversion when no mapping exists between a source code page and a target code page.

The code points for each of these characters, by code page is as follows:

Table 13. Code Points for Special Double-byte Characters

Code Page Double-Byte Percentage Double-Byte Underscore Double-byte Space Double-Byte Substitution Character
932 X'8193' X'8151' X'8140' X'FCFC'
938 X'8193' X'8151' X'8140' X'FCFC'
942 X'8193' X'8151' X'8140' X'FCFC'
943 X'8193' X'8151' X'8140' X'FCFC'
948 X'8193' X'8151' X'8140' X'FCFC'
949 X'A3A5' X'A3DF' X'A1A1' X'AFFE'
950 X'A248' X'A1C4' X'A140' X'C8FE'
954 X'A1F3' X'A1B2' X'A1A1' X'F4FE'
964 X'A2E8' X'A2A5' X'A1A1' X'FDFE'
970 X'A3A5' X'A3DF' X'A1A1' X'AFFE'
1381 X'A3A5' X'A3DF' X'A1A1' X'FEFE'
1383 X'A3A5' X'A3DF' X'A1A1' X'A1A1'
13488 X'FF05' X'FF3F' X'3000' X'FFFD'

Code Page	Double-Byte Percentage	Double-Byte Underscore	Double-byte Space	Double-Byte Substitution Character
932	X'8193'	X'8151'	X'8140'	X'FCFC'
938	X'8193'	X'8151'	X'8140'	X'FCFC'
942	X'8193'	X'8151'	X'8140'	X'FCFC'
943	X'8193'	X'8151'	X'8140'	X'FCFC'
948	X'8193'	X'8151'	X'8140'	X'FCFC'
949	X'A3A5'	X'A3DF'	X'A1A1'	X'AFFE'
950	X'A248'	X'A1C4'	X'A140'	X'C8FE'
954	X'A1F3'	X'A1B2'	X'A1A1'	X'F4FE'
964	X'A2E8'	X'A2A5'	X'A1A1'	X'FDFE'
970	X'A3A5'	X'A3DF'	X'A1A1'	X'AFFE'
1381	X'A3A5'	X'A3DF'	X'A1A1'	X'FEFE'
1383	X'A3A5'	X'A3DF'	X'A1A1'	X'A1A1'
13488	X'FF05'	X'FF3F'	X'3000'	X'FFFD'

EUC Considerations

The DBCS substitution character is used to replace any EUC non-SBCS character as required. There is no concept of a three or four byte substitution character.

Coding Remote Stored Procedures and UDFs

When coding stored procedures that will be running remotely, the following considerations apply:

Data in a stored procedure must be in the database code page.
Data passed to or from a stored procedure using an SQLDA with a character data type must really contain character data. Numeric data and data structures must never be passed with a character type if the client application code page is different from the database code page. This is because the server will convert all character data in an SQLDA. To avoid character conversion, you can pass data by defining it in binary string format by using a data type of BLOB or by defining the character data as FOR BIT DATA.

By default, when you invoke DB2 DARI stored procedures and UDFs, they run under a default national language environment which may not match the database's national language environment. Consequently, using country or code page specific operations, such as the C wchar_t graphic host variables and functions, may not work as you expect. You need to ensure that, if applicable, the correct environment is initialized when you invoke the stored procedure or UDF.

Package Name Considerations in Mixed Code Page Environments

Package names are determined when you invoke the PRECOMPILE PROGRAM command or API. By default, they are generated based on the first 8-bytes of the application program source file (without the file extension) and are folded to upper case. Optionally, a name can be explicitly defined. Regardless of the origin of a package name, if you are running in an unequal code page environment, the characters for your package names should be in the invariant character set. Otherwise you may experience problems related to the modification of your package name. The database manager will not be able to find the package for the application or a client-side tool will not display the right name for your package.

A package name modification due to character conversion will occur if any of the characters in the package name, or fragments of a character in a multi-byte code page, are not directly mapped to a valid character in the database code page. In such cases, a substitution character replaces the character that is not converted. After such a modification, the package name, when converted back to the application code page, may not match the original package name. An example of a case where this behavior is undesirable is when you use the DB2 Database Director to list and work with packages. Package names displayed may not match the expected names.

To avoid conversion problems with package names, ensure that only characters are used which are valid under both the application and database code pages.

Precompiling and Binding

At precompile/bind time, the precompiler is the executing application. The active code page when the database connection was made prior to the precompile request is used for precompiled statements, and any character data returned in the SQLCA.

Executing an Application

At execution time, the active code page of the user application when a database connection is made is in effect for the duration of the connection. All data is interpreted based on this code page; this includes dynamic SQL statements, user input data, user output data, and character fields in the SQLCA.

A Note of Caution

Failure to follow these guidelines may produce unpredictable results. These conditions cannot be detected by the database manager, so no error or warning message will result. For example, a C application contains the following SQL statements operating against a table T1 with one column defined as C1 CHAR(20):

     (0)  EXEC SQL CONNECT TO GLOBALDB;
     (1)  EXEC SQL INSERT INTO T1 VALUES ('a-constant');
          strcpy(sqlstmt, "SELECT C1 FROM T1 WHERE C1='a-constant');
     (2)  EXEC SQL PREPARE S1 FROM :sqlstmt;
Where:
    application code page at bind time = x
    application code page at execution time = y
    database code page = z

At bind time, 'a-constant' in statement (1) is converted from code page x to code page z. This conversion can be noted as (x>z).

At execution time, 'a-constant' (x>z) is inserted into the table when statement (1) is executed. However, the WHERE clause of statement (2) will be executed with 'a-constant' (y>z). If the code points in the constant are such that the two conversions (x>z and y>z) yield different results, the SELECT in statement (2) will fail to retrieve the data inserted by statement (1).

Conversion Between Different Code Pages

Ideally, for optimal performance, your applications should always use the same code page as your database. However, this is not always practical or possible. The DB2 products provide support for character conversion that allows your application and database to use different code pages. Characters from one code page must be mapped to the other code page in order to maintain meaning of the data.

When Does Character Conversion Occur?

Character conversion can occur in the following situations:

When a client or application accessing a database is running in a code page that is different from the code page of the database.

This database conversion will occur on the database server machine for both conversions from the application code page to the database code page and from the database code page to the application code page.

You can minimize or eliminate client/server character conversion in some situations. For example, you could:

Create a Windows NT database using code page 850 to match an OS/2 and Windows client application environment that predominately uses code page 850,
If a Windows ODBC application is used with the IBM DB2 ODBC driver in Windows database client, this problem may be alleviated by the use of the TRANSLATEDLL and TRANSLATEOPTION keywords in the odbc.ini or db2cli.ini file.
Create a DB2 for AIX database using code page 850 to match an OS/2 and DOS client application environment that predominately uses code page 850.

Note:

The DB2 for OS/2 Version 1.0 or Version 1.2 database server does not support character conversion between different code pages. Ensure that the code pages on server and client are compatible. See Table 30 for the code page conversons that are supported.

When a client or application importing a PC/IXF file runs in a code page that is different from the file being imported.
This data conversion will occur on the database client machine before the client accesses the database server. Additional data conversion may take place if the application is running in a code page that is different from the code page of the database (as stated in the previous point).
Data conversion, if any, also depends on how the import utility was called. See the Administration Guide for more information.
When DB2 Connect is used to access data on a DRDA server. In this case character conversion occurs by the receiver of data, as defined by the DRDA rules. For example, data that is sent to DB2 for MVS/ESA is converted to the appropriate MVS coded character set identifier (CCSID) by DB2 for MVS/ESA. The data sent back to the DB2 Connect machine from DB2 for MVS/ESA is converted by DB2 Connect. For more information, see the DB2 Connect Enterprise Edition Quick Beginnings for your platform.

Character conversion will not occur for:

File names. You should either use the ASCII invariant set for file names or provide the file name in the hexadecimal values that are physically stored in the file system. Note that if you include a file name as part of a SQL statement, it gets converted as part of the statement conversion.
Data that is targeted for or comes from a column assigned the FOR BIT DATA attribute, or data used in an SQL operation whose result is FOR BIT or BLOB data. In these cases, the data is treated as a byte stream and no conversion occurs. (6)
See the SQL Reference for unequal code page rules for assigning, comparing, and combining strings.
Access to a DB2 for OS/2 Version 1.0 or Version 1.2 database server.
A DB2 product or platform that does not support, or that does not have support installed, for the desired combination of code pages. In this case, an SQLCODE -332 (SQLSTATE 57017) is returned when you try to run your application.

Character Substitutions During Conversions

When your application converts from one code page to another, it is possible that one or more characters are not represented in the target code page. If this occurs, DB2 inserts a substitution character into the target string in place of the character that has no representation. The replacement character is then considered a valid part of the string. In situations where a substitution occurs, the SQLWARN10 indicator in the SQLCA is set to 'W'.
Note: Any character conversions resulting from using the WCHARTYPE CONVERT precompiler option will not flag a warning if any substitutions take place.

Supported Character Conversions

When data conversion occurs, conversion will take place from a source code page to a target code page.

The source code page is determined from the source of the data; data from the application has a source code page equal to the application code page, and data from the database has a source code page equal to the database code page.

The determination of target code page is more involved; where the data is to be placed, including rules for intermediate operations, is considered:

If the data is moved directly from an application into a database, with no intervening operations, the target code page is the database code page.
If the data is being imported into a database from a PC/IXF file, there are two character conversion steps:
1. From the PC/IXF file code page (source code page) to the application code page (target code page)
2. From the application code page (source code page) to the database code page (target code page).
Exercise caution in situations where two conversion steps might occur. Make sure you follow the supported character conversions listed in Table 30 to avoid a possible loss of character data. Additionally, within each group, only characters which exist in both the source and target code page have meaningful conversions. Other characters are used as "substitutions" and are only useful for converting from the target code page back to the source code page (and may provide meaningless conversions in the second stage).
If the data is derived from operations performed on character data, where the source may be any of the application code page, the database code page, FOR BIT DATA, or for BLOB data, data conversion is based on a set of rules. Some or all of the data items may have to be converted to an intermediate result, before the final target code page can be determined. See the SQL Reference for a summary of these rules and for specific application with individual operators and predicates.

Table 30 shows the code page conversions that are supported. Any code page can be converted to any other code page that is listed in the same IBM-defined language group. For example, code page 437 can be converted to 37, 819, 850, 1051, 1252, or 1275.
Note: Character string conversions between multi-byte code pages, for example DBCS and EUC, may result in either an increase or a decrease in the length of the string.

Character Conversion Expansion Factor

When your application successfully completes an attempt to connect to a DB2 database server, you should consider the following fields in the returned SQLCA:

The second token in the SQLERRMC field (tokens are separated by X'FF') indicates the code page of the database. The ninth token in the SQLERRMC field indicates the code page of the application. Querying the application's code page and comparing it to the database's code page informs the application whether it has established a connection which will undergo character conversions.
The first and second entries in the SQLERRD array. SQLERRD(1) contains an integer value equal to the maximum expected expansion or contraction factor for the length of mixed character data (CHAR data types) when converted to the database code page from the application code page. SQLERRD(2) contains an integer value equal to the maximum expected expansion or contraction factor for the length of mixed character data (CHAR data types) when converted to the application code page from the database code page. A value of 0 or 1 indicates no expansion; a value greater than 1 indicates a possible expansion in length; a negative value indicates a possible contraction. Refer to the SQL Reference for details on using the CONNECT statement.

The considerations for graphic string data should not be a factor in unequal code page situations. Each string always has the same number of characters, regardless of whether the data is in the application or the database code page.

See "Unequal Code Page Situations" for information on dealing with unequal code page situations.

Footnotes:

(6) However, a literal inserted into a column defined as FOR BIT DATA could be converted if that literal was part of an SQL statement which was converted.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]