分享
 
 
 

Explanation of UFT-8 and Unicode

王朝java/jsp·作者佚名  2006-01-09
窄屏简体版  字體: |||超大  

What is unicode?

A mapping with characters and a index, we use u+xxxx to represent it.

Confuse with unicode and UTF-8?

Unicode is a standard char set, UTF-8 is one of implementation, just one of UCS-2, UCS-4 and so forth, but it becomes standard way of encoding. but note one thing, when we are talking about some english characters, those two standard are the same, it means

U-00000000 - U-0000007F: 0xxxxxxx

sometimes, especially the programmer, since U-00000000 - U-0000007F is enough for their dialy use(26 english and some symbols), so, there is no different between the character set standards(unicode) and implementation standard(UTF-8) for them. when they are talking with you, you may confuse.

Why is UTF-8?

You may ask why not use UCS-4 or UCS-2? do people like 8 more(in cantonese, it means become rich)?

The answer is no. Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like '\0' or '/' which have a special meaning in filenames and other C library function parameters.

(An ASCII or Latin-1 file can be transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes instead before every ASCII byte.)

In addition, the majority of UNIX tools expects ASCII files and can't read 16-bit words as characters without major modifications.(In UTF-8

U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).

This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8)

------------prove the ASCII and UTF-8 are the same---------

package unicode;

public class CharTest {

public static void main(String[] args) throws Exception {

char[] chars = new char[]{'\u007F'};

String str = new String(chars);

System.out.println("within 0000 - 007F : " + str);

//for the character whose unicode less than u0080, it is no different with encode by

//ISO-8859-1 or UTF-8. they are compatiable.

System.out.println(" UTF-8 - UTF-8 " + new String(str.getBytes("UTF-8"),

"ISO-8859-1"));

System.out.println(" ISO-8859-1 - UTF-8 " +new String(str.getBytes("ISO-8859-1"),

"UTF-8"));

chars = new char[]{'\u00F2'};

str = new String(chars);

//The above principle can not apply to the character lager than 007F

System.out.println("out of 0000 - 007F : " + str);

System.out.println(" UTF-8 - UTF-8 " + new String(str.getBytes("UTF-8"),

"ISO-8859-1"));

System.out.println(" ISO-8859-1 - UTF-8 " + new String(str.getBytes("ISO-8859-

1"), "UTF-8"));

}

}

---------------------------------------------------------------------------------

How long is the UTF-8 encoding?

Theoretically, it can be 6 bytes, but actually, 3 byte is enough for us since BMP is not longer than 3(The most commonly used characters, including all those found in major older encoding standards,

have been placed into the first plane (0x0000 to 0xFFFD), which is called the Basic Multilingual Plane (BMP))

Important UTF-8 features:

1. UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

2. All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.

3. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. (?? the further investigate is necessary. can explain this currently)

4. All possible 231 UCS codes can be encoded.

5. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP

characters are only up to three bytes long.

6. The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

------------Prove the features(1,2,3)-----------------

package unicode;

public class UTF8Features {

public static void main(String[] args) throws Exception {

//Why not write some no-ASCII character in the src?

//Since it will depends on your system rather than

//a UTF-8 as your image

char[] chars = new char[]{'\u007F'};

String str = new String(chars);

System.out.println("Point 1 : " + str);

System.out.println(" UTF-8 - UTF-8 "

+ new String(str.getBytes("UTF-8"), "ISO-8859-1"));

System.out.println(" ISO-8859-1 - UTF-8 "

+ new String(str.getBytes("ISO-8859-1"), "UTF-8"));

System.out.println();

chars = new char[]{'\uE840'};

str = new String(chars);

System.out.println("Point 2 : " + str);

//just a sample you can use this method to verify more characters

System.out.println(" No less than 7F " + getHexString(str));

chars = new char[]{'\u2260'};

str = new String(chars);

//just a sample you can use this method to verify more characters

System.out.println("Point 3 : " + str);

System.out.println(" Range of 1st Byte " + getHexString(str));

}

public static String getHexString(String num) throws Exception {

StringBuffer sb = new StringBuffer();

//You must specify UTF-8 here, else it will use the defaul encoding

//which depends on your enviroment

byte[] bytes = num.getBytes("UTF-8");

for (int i = 0; i < bytes.length; i++) {

sb.append(Integer.toHexString((bytes[i] >= 0 ?

bytes[i] : 256 + bytes[i])).toUpperCase() + " ");

}

return sb.toString();

}

}

---------------------------------------------------------------------------------

Pinciple of presenting a unicode use UTF-8:

U-00000000 - U-0000007F: 0xxxxxxx

U-00000080 - U-000007FF: 110xxxxx 10xxxxxx

U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

How to use the principle above?

Sample:

The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as

11000010 10101001 = 0xC2 0xA9

Explain :

A:1010

9:1001

principle 2 : 00000080 < 00A9 < 000007FF

from low to high

1. There 6 x in the low bit we cut last 6 bit from - 10101001(A9) which is 101001

2.There 5 x in the high bit. we cut the rest 2 bit of A9 which is 10 and extend it to 5 bit with three 0 which is 00010

complete the low byte with 10. ----> (10) combine (101001) -> 10101001

complete the high byte with 110, ---> (110) combine (00010) -> 11000010

the Result is

11000010 10101001 = 0xC2 0xA9

you can also verify the following unicode with principle 3 use the way above:

U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

Reference:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#unicode

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
推荐阅读
 
 
 
>>返回首頁<<
 
靜靜地坐在廢墟上,四周的荒凉一望無際,忽然覺得,淒涼也很美
© 2005- 王朝網路 版權所有