Embedded SQL Programming Guide

Japanese and Traditional-Chinese EUC Code Set Considerations

Extended UNIX Code (EUC) denotes a set of general encoding rules that can support from one to four character sets in UNIX-based operating environments. The encoding rules are based on the ISO 2022 definition for encoding 7-bit and 8-bit data in which control characters are used to separate some of the character sets. EUC is a means of specifying a collection of code sets rather than a code set encoding scheme. A code set based on EUC conforms to the EUC encoding rules but also identifies the specific character sets associated with the specific instances. For example, the IBM-eucJP code set for Japanese refers to the encoding of the Japanese Industrial Standard characters according to the EUC encoding rules. For a list of code pages which are supported, refer to your platform's Quick Beginnings book.

Database and client application support for graphic (pure double-byte character) data, while running under EUC code pages with character encoding that is greater than two bytes in length is limited. The DB2 Universal Database products implement strict rules for graphic data that require all characters to be exactly two bytes wide. These rules do not allow many characters from both the Japanese and Traditional-Chinese EUC code pages. To overcome this situation, support is provided at both the application level and the database level to represent Japanese and Traditional-Chinese EUC graphic data using another encoding scheme.

A database created under either Japanese or Traditional-Chinese EUC code pages will actually store and manipulate graphic data using the ISO 10646 UCS-2 code set, a double-byte encoding scheme which is a proper subset of the full ISO 10646 Unicode standard. Similarly, an application running under those code pages will send graphic data to the database server as UCS-2 encoded data. With this support, applications running under EUC code pages can access the same types of data as those running under DBCS code pages. For additional information regarding EUC environments, refer to the SQL Reference. The IBM-defined code page identifier associated with UCS-2 encoded data for the DB2 common server products is 13488. The support for UCS-2 encoded data is at Level 1 of the standard.

The ISO 10646 standard specifies the encoding of a number of combining characters that are necessary in several scripts, such as Indic, Thai, Arabic and Hebrew. These characters can also be used for a productive generation of characters in Latin, Cyrillic, and Greek scripts. However their presence creates a possibility of an alternative coding for the same text. Although the coding is unambiguous and data integrity is preserved, a processing of text that contains combining characters is more complex. To provide for conformance of applications that choose not to deal with the combining characters, ISO 10646 defines three implementation levels:

Level 1. Does not allow combining characters.
Level 2. Allows combining marks from Thai, Indic, Hebrew and Arabic scripts.
Level 3. Allows all combining marks, including the ones for Latin, Cyrillic, and Greek.

For more information on the Unicode standard, see Unicode 1.0 Volumes 1 and 2 from Addison-Wesley.

If you are working with applications or databases using these character sets you may need to consider dealing with UCS-2 encoded data. When converting UCS-2 graphic data to the application's EUC code page, there is the possibility of an increase in the length of data. For details of data expansion, see "Character Conversion Expansion Factor". When large amounts of data are being displayed, it may be necessary to allocate buffers, convert, and display the data in a series of fragments.

The following sections discuss how to handle data in this environment. For these sections, the term EUC is used to refer only to Japanese and Traditional-Chinese EUC character sets. Note that the discussions do not apply to DB2 Korean or Simplified-Chinese EUC support since graphic data in these character sets is represented using the EUC encoding.

Mixed EUC and Double-Byte Client and Database Considerations

The administration of database objects in mixed EUC and double-byte code page environments is complicated by the possible expansion or contraction in the length of object names as a result of conversions between the client and database code page. In particular, many administrative commands and utilities have documented limits to the lengths of character strings which they may take as input or output parameters. These limits are typically enforced at the client, unless documented otherwise. For example, the limit for a table name is 18 bytes. It is possible that a character string which is 18 bytes under a double-byte code page is larger, say 21 bytes, under an EUC code page. This hypothetical 21-byte table name would be considered invalid by such commands as REORGANIZE TABLE if used as an input parameter despite being valid in the target double-byte database. Similarly, the maximum permitted length of output parameters may be exceeded, after conversion, from the database code page to the application code page. This may cause either a conversion error or output data truncation to occur.

If you expect to use administrative commands and utilities extensively in a mixed EUC and double-byte environment, you should define database objects and their associated data with the possibility of length expansion past the supported limits. Administration of an EUC database from a double-byte client will face less restrictions then administration of a double-byte database from an EUC client. Double-byte character strings will always be equal in length or shorter then the corresponding EUC character string. This will generally lead to less problems caused by enforcing the character string length limits.

Note: In the case of SQL statements, validation of input parameters is not conducted until the entire statement has been converted to the database code page. Thus you can use character strings which may be technically longer then allowed when they represented in the client code page, but which meet length requirements when represented in the database code page.

Considerations for Traditional-Chinese Users

Due to the standards definition for Traditional-Chinese, there is a side effect that you may encounter when you convert some characters between double-byte or EUC code pages and UCS-2. There are 189 characters (consisting of 187 radicals and 2 numbers) that share the same UCS-2 code point, when converted, as another character in the code set. When these characters are converted back to double-byte or EUC, they are converted to the code point of the same character's ideograph, with which it shares the same UCS-2 code point, rather then back to the original code point. When displayed, the character appears the same, but has a different code point. Depending on your application's design, you may have to take this behavior into account.

As an example, consider what happens to code point A7A1 in EUC code page 964, when it is converted to UCS-2 and then converted back to the original code page, EUC 946:

  EUC 946                UCS-2                EUC 946
 
   A7A1 ---------¿
                 ö------> 4E01 ---------------> C4A1
   C4A1 ---------û

Thus, the original code points,A7A1 and C4A1 end up as code point C4A1 after conversion.

If you require the code page conversion tables for EUC code pages 946 (Traditional-Chinese EUC) or 950 (Traditional-Chinese Big-5) and UCS-2, see the Product and Service Technical Library Technical home page at the following URL, "http://www.software.ibm.com/data/db2/support/servinfo/".

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]