RFC1922 - Chinese Character Encoding for Internet Messages

Network Working Group HF. Zhu

Request for Comments: 1922 Tsinghua U

Category: Informational DY. Hu

Tsinghua U

ZG. Wang

CITS

TC. Kao

III

WCH. Chang

III

M. Crispin

U Washington

March 1996

Chinese Character Encoding for Internet Messages

Status of this Memo

This memo provides information for the Internet community. It does

not specify an Internet standard. Distribution of this memo is

unlimited.

Abstract

This memo describes methods of transporting Chinese characters in

Internet services which transport text, sUCh as electronic mail

[RFC-822], network news [RFC-1036], telnet [RFC-854] and the World

Wide Web [RFC-1866].

Introduction

As the use of Internet covers more and more Chinese people in the

world, the need has increased for the ability to send documents

containing Chinese characters on the Internet. The methods described

in this document provide means of transporting existing Chinese

character sets as well as leaving space for future extension.

This document describes two encodings, ISO-2022-CN and

ISO-2022-CN-EXT. These are designed with interoperability in mind

and are encouraged in this document for current Chinese interchange;

they are 7-bit, support both simplified and traditional characters

using both GB and CNS/Big5, and do not impose any unusual quoting

requirements on ASCII characters.

As important related issues, this document gives detailed

descriptions of the two encodings CN-GB and CN-Big5, and a brief

description of ISO/IEC 10646 [ISO-10646]. CN-GB and CN-Big5 are

currently used as the internal codes for Chinese documents.

ISO-10646 is the universal multi-octet character set defined by ISO;

we feel that in the future it may become the preferred technology for

Chinese documents and electronic mail when it is widely available.

Specification

1. 7-bit Chinese encodings: ISO-2022-CN and ISO-2022-CN-EXT

1.1. Description

ISO-2022-CN is based on ISO 2022 [ISO-2022], similar to earlier work

on ISO-2022-JP [RFC-1468] and ISO-2022-KR [RFC-1557] for the Japanese

and Korean languages respectively. It is 7-bit, and supports both

simplified Chinese characters using GB 2312-80 [GB-2312] and

traditional Chinese characters using the first two planes of CNS

11643 [CNS-11643], as well as ASCII [ASCII] characters.

ISO-2022-CN-EXT is a superset of ISO-2022-CN that additionally

supports other GB character sets and planes of CNS 11643.

Since ISO-2022-CN and ISO-2022-CN-EXT are 7-bit encodings, they do

not require the 8-bit SMTP extensions. ISO-2022-CN supports all the

Chinese characters that appear in Big5 [BIG5].

1.2. ISO-2022-CN

The starting code of ISO-2022-CN is ASCII. ASCII and Chinese

characters are distinguished by designations (ESC sequences) and

shift functions.

Designations define the Chinese character sets used in the text.

There are three kinds of designations: SOdesignation, SS2designation

and SS3designation.

The SOdesignation is in the form ESC $ ) <F>, where <F> is the "final

character" assigned to the character set by ISO (refer to the ISO

registry [ISOREG] for more details). The SS2designation is in the

form ESC $ * <F>, and the SS3designation is in the form ESC $ + <F>.

A designation overrides any previous designation for subsequent bytes

in the text.

There are four kinds of shifts: SI, SO, SS2 and SS3. Shift functions

specify how to interpret the subsequent bytes.

The shift SI (one byte with hexadecimal value 0F) declares that

subsequent bytes are interpreted in ASCII.

The shift SO (one byte with hexadecimal value 0E) declares that

subsequent bytes are interpreted in the character set defined by

SOdesignation.

The shift SS2 (two bytes with hexadecimal values 1B 4E) declares that

the subsequent TWO bytes are interpreted in the character set defined

by SS2designation, after which the previous interpretation (from SI

or SO) is restored.

The shift SS3 (two bytes with hexadecimal values 1B 4F) declares that

the subsequent TWO bytes are interpreted in the character set defined

by SS3designation, after which the previous interpretation (from SI

or SO) is restored.

The escape sequences, shift functions and character sets used in an

ISO-2022-CN text are as follows:

Character sets Shift in with

--------------------------------------------------------------------

ASCII SI

GB 2312, CNS 11643-plane-1 SO

CNS 11643-plane-2 SS2

ESC $ ) A Indicates the bytes following SO are Chinese

characters as defined in GB 2312-80, until

another SOdesignation appears

ESC $ ) G Indicates the bytes following SO are as defined

in CNS 11643-plane-1, until another

SOdesignation appears

ESC $ * H Indicates the two bytes immediately following

SS2 is a Chinese character as defined in CNS

11643-plane-2, until another SS2designation

appears

If there are any GB or CNS characters on a line, a designation for

the corresponding character set must be used so that each line has

its own character set information and the text can be displayed

correctly when scroll back in a window. Also, there must be a shift

to ASCII (SI) before the end of the line (i.e., before the CRLF). In

other Words, each line starts in ASCII, and ends in ASCII.

Example: the hex sequence

1b 24 29 41 0e 3d 3b 3b 3b 1b 24 29 47 47 28 5f 50 0f

represents the Chinese word for "Interchange" (jiao huan) twice;

the first time in simplified form using GB-2312 (the 3d 3b 3b 3b

sequence above), and the second time in traditional form using

CNS-11643 (the 47 28 5f 50 sequence above). The sequence 1b 24 29

41 is the SOdesignation for GB-2312, the 0e is SO to switch to

Chinese from ASCII, the 1b 24 29 47 is the SOdesignation for

CNS-11643 plane 1, and finally the 0f is the SI to return to ASCII

at the end of the line.

The name given to this character encoding is "ISO-2022-CN". This name

is intended to be used as the "charset" parameter in MIME [MIME-1,

MIME-2] messages.

Content-Type: text/plain; charset=iso-2022-cn

The ISO-2022-CN encoding is already in 7-bit form, so it is not

necessary to use a Content-Transfer-Encoding header.

Other restrictions are given in the "Formal Syntax of ISO-2022-CN"

(Section 7.1 of this document).

1.3. ISO-2022-CN-EXT

ISO-2022-CN-EXT supports all characters in existing GB, Big5 and CNS

11643 character sets.

The escape sequences, shift functions and character sets used in an

ISO-2022-CN-EXT text are as follows:

Character sets Shift in with

--------------------------------------------------------------------

ASCII SI

GB 2312, GB 12345, CNS 11643-plane-1, ISO-IR-165 SO

GB 7589, GB 13131, CNS 11643-plane-2 SS2

GB 7590, GB 13132 or other new GBs,CNS 11643-plane-3 or SS3

higher planes of CNS 11643

Note: Currently, there are some GB sets that have not been

registered in ISO. Here <X7589>, <X7590>, <X12345>, <X13131> and

<X13132> represent the final character that will be assigned by

ISO for those sets. These GB sets shall only be used once these

final characters are assigned.

ESC $ ) A Indicates the bytes following SO are Chinese

characters as defined in GB 2312-80, until

another SOdesignation appears

ESC $ * <X7589> Indicates the two bytes immediately following

SS2 is a Chinese character as defined in GB

7589-87 [GB-7589], until another SS2designation

appears

ESC $ + <X7590> Indicates the two bytes immediately following

SS3 is a Chinese character as defined in GB

7590-87 [GB-7590], until another SS3designation

appears

ESC $ ) <X12345> Indicates the bytes following SO are as defined

in GB 12345-90 [GB-12345], until another

SOdesignation appears

ESC $ * <X13131> Indicates the two bytes immediately following

SS2 is a Chinese character as defined in GB

13131-91 [GB-13131], until another

SS2designation appears

ESC $ + <X13132> Indicates the two bytes immediately following

SS3 is a Chinese character as defined in GB

13132-91 [GB-13131], until another

SS3designation appears

ESC $ ) E Indicates the bytes following SO are as defined

in ISO-IR-165 (for details, see section 2.1),

until another SOdesignation appears

ESC $ ) G Indicates the bytes following SO are as defined

in CNS 11643-plane-1, until another

SOdesignation appears

ESC $ * H Indicates the two bytes immediately following

SS2 is a Chinese character as defined in CNS

11643-plane-2, until another SS2designation

appears

ESC $ + I Indicates the immediate two bytes following SS3