分享
 
 
 

Java中文处理学习笔记——Hello Unicode

王朝厨房·作者佚名  2007-01-02
窄屏简体版  字體: |||超大  

版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本声明

http://www.chedong.com/tech/hello_unicode.html

关键词:linux java mutlibyte encoding locale i18n i10n chinese ISO-8859-1 GB2312 BIG5 GBK UNICODE

内容摘要:

不知道你有没有这样的感受:为什么PHP很少有乱码问题而用Java做WEB应用却这么麻烦呢?为什么在Google上能用简体中文查到繁体中文,甚至日文的结果?而且用Google的时候发现它居然能自动根据我使用浏览器的语言选择自动调出中文界面?

很多国际化应用的让我理解了这么一个道理:Unicode是为更方便的做国际化应用设计的,而Java核心的字符是基于UNICODE的,这一机制为应用提供了对中文“字”的控制(而不是字节)。但如果不仔细理解其中的规范,这种自由反而会成为累赘,从而导致更多的乱码问题:

试验1:操作系统语言环境设置对Java应用缺省编码方式的影响

为了了解Java应用的编码处理的机制,首先要了解操作系统对JVM缺省编码方式的影响,因此我做了一个Env.java,用于打印显示不同系统下JVM的属性和系统支持的LOCALE。程序很简单:

/*

* Copyright (c) 2002 Email: chedongATbigfoot.com/chedongATchedong.com

* $Id: hello_unicode.html,v 1.6 2003/11/09 07:57:11 chedong Exp $

*/

import java.util.*;

import java.text.*;

/**

* 目的:

* 显示环境变量和JVM的缺省属性

* 输入:无

* 输出:

* 1 支持的LOCALE

* 2 JVM的缺省属性

*/

public class Env {

/**

* main entrance

*/

public static void main(String[] args) {

System.out.println("Hello, it's: " + new Date());

//print available locales

Locale list[] = DateFormat.getAvailableLocales();

System.out.println("======System available locales:======== ");

for (int i = 0; i < list.length; i++) {

System.out.println(list[i].toString() + "\t" + list[i].getDisplayName());

}

//print JVM default properties

System.out.println("======System property======== ");

System.getProperties().list(System.out);

}

}

最需要注意的是JVM的file.encoding属性,这个属性确定了JVM的缺省的编码/解码方式:从而影响应用中所有字节流==>字符流的解码方式 ,字符流==>字节流的编码方式。

LINUX下的LOCALE可以通过 LANG=zh_CN; LC_ALL=zh_CN.GBK; export LANG LC_ALL 设置。locale 命令可以显示系统当前的环境设置

Windows的LOCALE可以通过 控制面板==>区域设置 设置实现

GNU/Linux 2.4.x (J2SE1.3.1)

LANG=en_US LC_ALL=en_US

GNU/Linux 2.4.x (J2SE1.3.1)

LANG=zh_CN LC_ALL=zh_CN.GBK

Windows 2000(J2SE1.3.0)

区域设置:中国 中文

Windows 2000(J2SE1.3.0)

区域设置:英国 英文

Hello, it's: Tue Jul 30 11:05:44 CST 2002

======System available locales:========

en English

en_US English (United States)

ar Arabic

ar_AE Arabic (United Arab Emirates)

ar_BH Arabic (Bahrain)

ar_DZ Arabic (Algeria)

ar_EG Arabic (Egypt)

ar_IQ Arabic (Iraq)

ar_JO Arabic (Jordan)

ar_KW Arabic (Kuwait)

ar_LB Arabic (Lebanon)

ar_LY Arabic (Libya)

ar_MA Arabic (Morocco)

ar_OM Arabic (Oman)

ar_QA Arabic (Qatar)

ar_SA Arabic (Saudi Arabia)

ar_SD Arabic (Sudan)

ar_SY Arabic (Syria)

ar_TN Arabic (Tunisia)

ar_YE Arabic (Yemen)

be Byelorussian

be_BY Byelorussian (Belarus)

bg Bulgarian

bg_BG Bulgarian (Bulgaria)

ca Catalan

ca_ES Catalan (Spain)

ca_ES_EURO Catalan (Spain,Euro)

cs Czech

cs_CZ Czech (Czech Republic)

da Danish

da_DK Danish (Denmark)

de German

de_AT German (Austria)

de_AT_EURO German (Austria,Euro)

de_CH German (Switzerland)

de_DE German (Germany)

de_DE_EURO German (Germany,Euro)

de_LU German (Luxembourg)

de_LU_EURO German (Luxembourg,Euro)

el Greek

el_GR Greek (Greece)

en_AU English (Australia)

en_CA English (Canada)

en_GB English (United Kingdom)

en_IE English (Ireland)

en_IE_EURO English (Ireland,Euro)

en_NZ English (New Zealand)

en_ZA English (South Africa)

es Spanish

es_BO Spanish (Bolivia)

es_AR Spanish (Argentina)

es_CL Spanish (Chile)

es_CO Spanish (Colombia)

es_CR Spanish (Costa Rica)

es_DO Spanish (Dominican Republic)

es_EC Spanish (Ecuador)

es_ES Spanish (Spain)

es_ES_EURO Spanish (Spain,Euro)

es_GT Spanish (Guatemala)

es_HN Spanish (Honduras)

es_MX Spanish (Mexico)

es_NI Spanish (Nicaragua)

et Estonian

es_PA Spanish (Panama)

es_PE Spanish (Peru)

es_PR Spanish (Puerto Rico)

es_PY Spanish (Paraguay)

es_SV Spanish (El Salvador)

es_UY Spanish (Uruguay)

es_VE Spanish (Venezuela)

et_EE Estonian (Estonia)

fi Finnish

fi_FI Finnish (Finland)

fi_FI_EURO Finnish (Finland,Euro)

fr French

fr_BE French (Belgium)

fr_BE_EURO French (Belgium,Euro)

fr_CA French (Canada)

fr_CH French (Switzerland)

fr_FR French (France)

fr_FR_EURO French (France,Euro)

fr_LU French (Luxembourg)

fr_LU_EURO French (Luxembourg,Euro)

hr Croatian

hr_HR Croatian (Croatia)

hu Hungarian

hu_HU Hungarian (Hungary)

is Icelandic

is_IS Icelandic (Iceland)

it Italian

it_CH Italian (Switzerland)

it_IT Italian (Italy)

it_IT_EURO Italian (Italy,Euro)

iw Hebrew

iw_IL Hebrew (Israel)

ja Japanese

ja_JP Japanese (Japan)

ko Korean

ko_KR Korean (South Korea)

lt Lithuanian

lt_LT Lithuanian (Lithuania)

lv Latvian (Lettish)

lv_LV Latvian (Lettish) (Latvia)

mk Macedonian

mk_MK Macedonian (Macedonia)

nl Dutch

nl_BE Dutch (Belgium)

nl_BE_EURO Dutch (Belgium,Euro)

nl_NL Dutch (Netherlands)

nl_NL_EURO Dutch (Netherlands,Euro)

no Norwegian

no_NO Norwegian (Norway)

no_NO_NY Norwegian (Norway,Nynorsk)

pl Polish

pl_PL Polish (Poland)

pt Portuguese

pt_BR Portuguese (Brazil)

pt_PT Portuguese (Portugal)

pt_PT_EURO Portuguese (Portugal,Euro)

ro Romanian

ro_RO Romanian (Romania)

ru Russian

ru_RU Russian (Russia)

sh Serbo-Croatian

sh_YU Serbo-Croatian (Yugoslavia)

sk Slovak

sk_SK Slovak (Slovakia)

sl Slovenian

sl_SI Slovenian (Slovenia)

sq Albanian

sq_AL Albanian (Albania)

sr Serbian

sr_YU Serbian (Yugoslavia)

sv Swedish

sv_SE Swedish (Sweden)

th Thai

th_TH Thai (Thailand)

tr Turkish

tr_TR Turkish (Turkey)

uk Ukrainian

uk_UA Ukrainian (Ukraine)

zh Chinese

zh_CN Chinese (China)

zh_HK Chinese (Hong Kong)

zh_TW Chinese (Taiwan)

======System property========

-- listing properties --

java.runtime.name=Java(TM) 2 Runtime Environment, Stand...

sun.boot.library.path=/usr/java/jdk1.3.1_04/jre/lib/i386

java.vm.version=1.3.1_04-b02

java.vm.vendor=Sun Microsystems Inc.

java.vendor.url=http://java.sun.com/

path.separator=:

java.vm.name=Java HotSpot(TM) Client VM

file.encoding.pkg=sun.io

java.vm.specification.name=Java Virtual Machine Specification

user.dir=/home/chedong/src/char_test

java.runtime.version=1.3.1_04-b02

java.awt.graphicsenv=sun.awt.X11GraphicsEnvironment

os.arch=i386

java.io.tmpdir=/tmp

line.separator=

java.vm.specification.vendor=Sun Microsystems Inc.

java.awt.fonts=

os.name=Linux

java.library.path=/usr/java/jdk1.3.1_04/jre/lib/i386:/u...

java.specification.name=Java Platform API Specification

java.class.version=47.0

os.version=2.4.7-10

user.home=/home/chedong

user.timezone=Asia/Shanghai

java.awt.printerjob=sun.awt.motif.PSPrinterJob

file.encoding=ISO-8859-1

java.specification.version=1.3

user.name=chedong

java.class.path=/home/chedong/classes

java.vm.specification.version=1.0

java.home=/usr/java/jdk1.3.1_04/jre

user.language=en

java.specification.vendor=Sun Microsystems Inc.

java.vm.info=mixed mode

java.version=1.3.1_04

java.ext.dirs=/usr/java/jdk1.3.1_04/jre/lib/ext

sun.boot.class.path=/usr/java/jdk1.3.1_04/jre/lib/rt.jar:...

java.vendor=Sun Microsystems Inc.

file.separator=/

java.vendor.url.bug=http://java.sun.com/cgi-bin/bugreport...

sun.cpu.endian=little

sun.io.unicode.encoding=UnicodeLittle

user.region=US

sun.cpu.isalist=

Hello, it's: Tue Jul 30 11:07:34 CST 2002

======System available locales:========

en 英文

en_US 英文 (美国)

ar 阿拉伯文

ar_AE 阿拉伯文 (阿拉伯联合酋长国)

ar_BH 阿拉伯文 (巴林)

ar_DZ 阿拉伯文 (阿尔及利亚)

ar_EG 阿拉伯文 (埃及)

ar_IQ 阿拉伯文 (伊拉克)

ar_JO 阿拉伯文 (约旦)

ar_KW 阿拉伯文 (科威特)

ar_LB 阿拉伯文 (黎巴嫩)

ar_LY 阿拉伯文 (利比亚)

ar_MA 阿拉伯文 (摩洛哥)

ar_OM 阿拉伯文 (阿曼)

ar_QA 阿拉伯文 (卡塔尔)

ar_SA 阿拉伯文 (沙特阿拉伯)

ar_SD 阿拉伯文 (苏丹)

ar_SY 阿拉伯文 (叙利亚)

ar_TN 阿拉伯文 (突尼斯)

ar_YE 阿拉伯文 (也门)

be 白俄罗斯文

be_BY 白俄罗斯文 (白俄罗斯)

bg 保加利亚文

bg_BG 保加利亚文 (保加利亚)

ca 加泰罗尼亚文

ca_ES 加泰罗尼亚文 (西班牙)

ca_ES_EURO 加泰罗尼亚文 (西班牙,Euro)

cs 捷克文

cs_CZ 捷克文 (捷克共和国)

da 丹麦文

da_DK 丹麦文 (丹麦)

de 德文

de_AT 德文 (奥地利)

de_AT_EURO 德文 (奥地利,Euro)

de_CH 德文 (瑞士)

de_DE 德文 (德国)

de_DE_EURO 德文 (德国,Euro)

de_LU 德文 (卢森堡)

de_LU_EURO 德文 (卢森堡,Euro)

el 希腊文

el_GR 希腊文 (希腊)

en_AU 英文 (澳大利亚)

en_CA 英文 (加拿大)

en_GB 英文 (英国)

en_IE 英文 (爱尔兰)

en_IE_EURO 英文 (爱尔兰,Euro)

en_NZ 英文 (新西兰)

en_ZA 英文 (南非)

es 西班牙文

es_BO 西班牙文 (玻利维亚)

es_AR 西班牙文 (阿根廷)

es_CL 西班牙文 (智利)

es_CO 西班牙文 (哥伦比亚)

es_CR 西班牙文 (哥斯达黎加)

es_DO 西班牙文 (多米尼加共和国)

es_EC 西班牙文 (厄瓜多尔)

es_ES 西班牙文 (西班牙)

es_ES_EURO 西班牙文 (西班牙,Euro)

es_GT 西班牙文 (危地马拉)

es_HN 西班牙文 (洪都拉斯)

es_MX 西班牙文 (墨西哥)

es_NI 西班牙文 (尼加拉瓜)

et 爱沙尼亚文

es_PA 西班牙文 (巴拿马)

es_PE 西班牙文 (秘鲁)

es_PR 西班牙文 (波多黎哥)

es_PY 西班牙文 (巴拉圭)

es_SV 西班牙文 (萨尔瓦多)

es_UY 西班牙文 (乌拉圭)

es_VE 西班牙文 (委内瑞拉)

et_EE 爱沙尼亚文 (爱沙尼亚)

fi 芬兰文

fi_FI 芬兰文 (芬兰)

fi_FI_EURO 芬兰文 (芬兰,Euro)

fr 法文

fr_BE 法文 (比利时)

fr_BE_EURO 法文 (比利时,Euro)

fr_CA 法文 (加拿大)

fr_CH 法文 (瑞士)

fr_FR 法文 (法国)

fr_FR_EURO 法文 (法国,Euro)

fr_LU 法文 (卢森堡)

fr_LU_EURO 法文 (卢森堡,Euro)

hr 克罗地亚文

hr_HR 克罗地亚文 (克罗地亚)

hu 匈牙利文

hu_HU 匈牙利文 (匈牙利)

is 冰岛文

is_IS 冰岛文 (冰岛)

it 意大利文

it_CH 意大利文 (瑞士)

it_IT 意大利文 (意大利)

it_IT_EURO 意大利文 (意大利,Euro)

iw 希伯来文

iw_IL 希伯来文 (以色列)

ja 日文

ja_JP 日文 (日本)

ko 朝鲜文

ko_KR 朝鲜文 (南朝鲜)

lt 立陶宛文

lt_LT 立陶宛文 (立陶宛)

lv 拉托维亚文(列托)

lv_LV 拉托维亚文(列托) (拉脱维亚)

mk 马其顿文

mk_MK 马其顿文 (马其顿王国)

nl 荷兰文

nl_BE 荷兰文 (比利时)

nl_BE_EURO 荷兰文 (比利时,Euro)

nl_NL 荷兰文 (荷兰)

nl_NL_EURO 荷兰文 (荷兰,Euro)

no 挪威文

no_NO 挪威文 (挪威)

no_NO_NY 挪威文 (挪威,Nynorsk)

pl 波兰文

pl_PL 波兰文 (波兰)

pt 葡萄牙文

pt_BR 葡萄牙文 (巴西)

pt_PT 葡萄牙文 (葡萄牙)

pt_PT_EURO 葡萄牙文 (葡萄牙,Euro)

ro 罗马尼亚文

ro_RO 罗马尼亚文 (罗马尼亚)

ru 俄文

ru_RU 俄文 (俄罗斯)

sh 塞波尼斯-克罗地亚文

sh_YU 塞波尼斯-克罗地亚文 (南斯拉夫)

sk 斯洛伐克文

sk_SK 斯洛伐克文 (斯洛伐克)

sl 斯洛文尼亚文

sl_SI 斯洛文尼亚文 (斯洛文尼亚)

sq 阿尔巴尼亚文

sq_AL 阿尔巴尼亚文 (阿尔巴尼亚)

sr 塞尔维亚文

sr_YU 塞尔维亚文 (南斯拉夫)

sv 瑞典文

sv_SE 瑞典文 (瑞典)

th 泰文

th_TH 泰文 (泰国)

tr 土耳其文

tr_TR 土耳其文 (土耳其)

uk 乌克兰文

uk_UA 乌克兰文 (乌克兰)

zh 中文

zh_CN 中文 (中国)

zh_HK 中文 (香港)

zh_TW 中文 (台湾)

======System property========

-- listing properties --

java.runtime.name=Java(TM) 2 Runtime Environment, Stand...

sun.boot.library.path=/usr/java/jdk1.3.1_04/jre/lib/i386

java.vm.version=1.3.1_04-b02

java.vm.vendor=Sun Microsystems Inc.

java.vendor.url=http://java.sun.com/

path.separator=:

java.vm.name=Java HotSpot(TM) Client VM

file.encoding.pkg=sun.io

java.vm.specification.name=Java Virtual Machine Specification

user.dir=/home/chedong/src/char_test

java.runtime.version=1.3.1_04-b02

java.awt.graphicsenv=sun.awt.X11GraphicsEnvironment

os.arch=i386

java.io.tmpdir=/tmp

line.separator=

java.vm.specification.vendor=Sun Microsystems Inc.

java.awt.fonts=

os.name=Linux

java.library.path=/usr/java/jdk1.3.1_04/jre/lib/i386:/u...

java.specification.name=Java Platform API Specification

java.class.version=47.0

os.version=2.4.7-10

user.home=/home/chedong

user.timezone=Asia/Shanghai

java.awt.printerjob=sun.awt.motif.PSPrinterJob

file.encoding=GBK

java.specification.version=1.3

user.name=chedong

java.class.path=/home/chedong/classes

java.vm.specification.version=1.0

java.home=/usr/java/jdk1.3.1_04/jre

user.language=zh

java.specification.vendor=Sun Microsystems Inc.

java.vm.info=mixed mode

java.version=1.3.1_04

java.ext.dirs=/usr/java/jdk1.3.1_04/jre/lib/ext

sun.boot.class.path=/usr/java/jdk1.3.1_04/jre/lib/rt.jar:...

java.vendor=Sun Microsystems Inc.

file.separator=/

java.vendor.url.bug=http://java.sun.com/cgi-bin/bugreport...

sun.cpu.endian=little

sun.io.unicode.encoding=UnicodeLittle

user.region=CN

sun.cpu.isalist=

Hello, it's: Tue Jul 30 11:49:36 CST 2002

======System available locales:========

en English

en_US English (United States)

ar Arabic

ar_AE Arabic (United Arab Emirates)

ar_BH Arabic (Bahrain)

ar_DZ Arabic (Algeria)

ar_EG Arabic (Egypt)

ar_IQ Arabic (Iraq)

ar_JO Arabic (Jordan)

ar_KW Arabic (Kuwait)

ar_LB Arabic (Lebanon)

ar_LY Arabic (Libya)

ar_MA Arabic (Morocco)

ar_OM Arabic (Oman)

ar_QA Arabic (Qatar)

ar_SA Arabic (Saudi Arabia)

ar_SD Arabic (Sudan)

ar_SY Arabic (Syria)

ar_TN Arabic (Tunisia)

ar_YE Arabic (Yemen)

be Byelorussian

be_BY Byelorussian (Belarus)

bg Bulgarian

bg_BG Bulgarian (Bulgaria)

ca Catalan

ca_ES Catalan (Spain)

ca_ES_EURO Catalan (Spain,Euro)

cs Czech

cs_CZ Czech (Czech Republic)

da Danish

da_DK Danish (Denmark)

de German

de_AT German (Austria)

de_AT_EURO German (Austria,Euro)

de_CH German (Switzerland)

de_DE German (Germany)

de_DE_EURO German (Germany,Euro)

de_LU German (Luxembourg)

de_LU_EURO German (Luxembourg,Euro)

el Greek

el_GR Greek (Greece)

en_AU English (Australia)

en_CA English (Canada)

en_GB English (United Kingdom)

en_IE English (Ireland)

en_IE_EURO English (Ireland,Euro)

en_NZ English (New Zealand)

en_ZA English (South Africa)

es Spanish

es_AR Spanish (Argentina)

es_BO Spanish (Bolivia)

es_CL Spanish (Chile)

es_CO Spanish (Colombia)

es_CR Spanish (Costa Rica)

es_DO Spanish (Dominican Republic)

es_EC Spanish (Ecuador)

es_ES Spanish (Spain)

es_ES_EURO Spanish (Spain,Euro)

es_GT Spanish (Guatemala)

es_HN Spanish (Honduras)

es_MX Spanish (Mexico)

es_NI Spanish (Nicaragua)

es_PA Spanish (Panama)

es_PE Spanish (Peru)

es_PR Spanish (Puerto Rico)

es_PY Spanish (Paraguay)

es_SV Spanish (El Salvador)

es_UY Spanish (Uruguay)

es_VE Spanish (Venezuela)

et Estonian

et_EE Estonian (Estonia)

fi Finnish

fi_FI Finnish (Finland)

fi_FI_EURO Finnish (Finland,Euro)

fr French

fr_BE French (Belgium)

fr_BE_EURO French (Belgium,Euro)

fr_CA French (Canada)

fr_CH French (Switzerland)

fr_FR French (France)

fr_FR_EURO French (France,Euro)

fr_LU French (Luxembourg)

fr_LU_EURO French (Luxembourg,Euro)

hr Croatian

hr_HR Croatian (Croatia)

hu Hungarian

hu_HU Hungarian (Hungary)

is Icelandic

is_IS Icelandic (Iceland)

it Italian

it_CH Italian (Switzerland)

it_IT Italian (Italy)

it_IT_EURO Italian (Italy,Euro)

iw Hebrew

iw_IL Hebrew (Israel)

ja Japanese

ja_JP Japanese (Japan)

ko 韩文

ko_KR 韩文 (大韩民国)

lt Lithuanian

lt_LT Lithuanian (Lithuania)

lv Latvian (Lettish)

lv_LV Latvian (Lettish) (Latvia)

mk Macedonian

mk_MK Macedonian (Macedonia)

nl Dutch

nl_BE Dutch (Belgium)

nl_BE_EURO Dutch (Belgium,Euro)

nl_NL Dutch (Netherlands)

nl_NL_EURO Dutch (Netherlands,Euro)

no Norwegian

no_NO Norwegian (Norway)

no_NO_NY Norwegian (Norway,Nynorsk)

pl Polish

pl_PL Polish (Poland)

pt Portuguese

pt_BR Portuguese (Brazil)

pt_PT Portuguese (Portugal)

pt_PT_EURO Portuguese (Portugal,Euro)

ro Romanian

ro_RO Romanian (Romania)

ru Russian

ru_RU Russian (Russia)

sh Serbo-Croatian

sh_YU Serbo-Croatian (Yugoslavia)

sk Slovak

sk_SK Slovak (Slovakia)

sl Slovenian

sl_SI Slovenian (Slovenia)

sq Albanian

sq_AL Albanian (Albania)

sr Serbian

sr_YU Serbian (Yugoslavia)

sv Swedish

sv_SE Swedish (Sweden)

th Thai

th_TH Thai (Thailand)

tr Turkish

tr_TR Turkish (Turkey)

uk Ukrainian

uk_UA Ukrainian (Ukraine)

zh 中文

zh_CN 中文 (中华人民共和国)

zh_HK 中文 (香港)

zh_TW 中文 (台湾)

======System property========

-- listing properties --

java.runtime.name=Java(TM) 2 Runtime Environment, Stand...

sun.boot.library.path=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...

java.vm.version=1.3.0_02

java.vm.vendor=Sun Microsystems Inc.

java.vendor.url=http://java.sun.com/

path.separator=;

java.vm.name=Java HotSpot(TM) Client VM

file.encoding.pkg=sun.io

java.vm.specification.name=Java Virtual Machine Specification

user.dir=D:\java\src\char_test

java.runtime.version=1.3.0_02

java.awt.graphicsenv=sun.awt.Win32GraphicsEnvironment

os.arch=x86

java.io.tmpdir=D:\TEMPline.separator=

java.vm.specification.vendor=Sun Microsystems Inc.

java.awt.fonts=

os.name=Windows 98

java.library.path=C:\WINDOWS;.;C:\WINDOWS\SYSTEM;C:\WIN...

java.specification.name=Java Platform API Specification

java.class.version=47.0

os.version=4.90

user.home=C:\WINDOWS

user.timezone=Asia/Shanghai

java.awt.printerjob=sun.awt.windows.WPrinterJob

file.encoding=GBK

java.specification.version=1.3

user.name=Sicci

java.class.path=d:\java\classes

java.vm.specification.version=1.0

java.home=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_02

user.language=zh

java.specification.vendor=Sun Microsystems Inc.

awt.toolkit=sun.awt.windows.WToolkit

java.vm.info=mixed mode

java.version=1.3.0_02

java.ext.dirs=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...

sun.boot.class.path=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...

java.vendor=Sun Microsystems Inc.

file.separator=java.vendor.url.bug=http://java.sun.com/cgi-bin/bugreport...

sun.cpu.endian=little

sun.io.unicode.encoding=UnicodeLittle

user.region=CN

sun.cpu.isalist=pentium i486 i386

Hello, it's: Tue Jul 30 11:53:27 CST 2002

======System available locales:========

en English

en_US English (United States)

ar Arabic

ar_AE Arabic (United Arab Emirates)

ar_BH Arabic (Bahrain)

ar_DZ Arabic (Algeria)

ar_EG Arabic (Egypt)

ar_IQ Arabic (Iraq)

ar_JO Arabic (Jordan)

ar_KW Arabic (Kuwait)

ar_LB Arabic (Lebanon)

ar_LY Arabic (Libya)

ar_MA Arabic (Morocco)

ar_OM Arabic (Oman)

ar_QA Arabic (Qatar)

ar_SA Arabic (Saudi Arabia)

ar_SD Arabic (Sudan)

ar_SY Arabic (Syria)

ar_TN Arabic (Tunisia)

ar_YE Arabic (Yemen)

be Byelorussian

be_BY Byelorussian (Belarus)

bg Bulgarian

bg_BG Bulgarian (Bulgaria)

ca Catalan

ca_ES Catalan (Spain)

ca_ES_EURO Catalan (Spain,Euro)

cs Czech

cs_CZ Czech (Czech Republic)

da Danish

da_DK Danish (Denmark)

de German

de_AT German (Austria)

de_AT_EURO German (Austria,Euro)

de_CH German (Switzerland)

de_DE German (Germany)

de_DE_EURO German (Germany,Euro)

de_LU German (Luxembourg)

de_LU_EURO German (Luxembourg,Euro)

el Greek

el_GR Greek (Greece)

en_AU English (Australia)

en_CA English (Canada)

en_GB English (United Kingdom)

en_IE English (Ireland)

en_IE_EURO English (Ireland,Euro)

en_NZ English (New Zealand)

en_ZA English (South Africa)

es Spanish

es_AR Spanish (Argentina)

es_BO Spanish (Bolivia)

es_CL Spanish (Chile)

es_CO Spanish (Colombia)

es_CR Spanish (Costa Rica)

es_DO Spanish (Dominican Republic)

es_EC Spanish (Ecuador)

es_ES Spanish (Spain)

es_ES_EURO Spanish (Spain,Euro)

es_GT Spanish (Guatemala)

es_HN Spanish (Honduras)

es_MX Spanish (Mexico)

es_NI Spanish (Nicaragua)

es_PA Spanish (Panama)

es_PE Spanish (Peru)

es_PR Spanish (Puerto Rico)

es_PY Spanish (Paraguay)

es_SV Spanish (El Salvador)

es_UY Spanish (Uruguay)

es_VE Spanish (Venezuela)

et Estonian

et_EE Estonian (Estonia)

fi Finnish

fi_FI Finnish (Finland)

fi_FI_EURO Finnish (Finland,Euro)

fr French

fr_BE French (Belgium)

fr_BE_EURO French (Belgium,Euro)

fr_CA French (Canada)

fr_CH French (Switzerland)

fr_FR French (France)

fr_FR_EURO French (France,Euro)

fr_LU French (Luxembourg)

fr_LU_EURO French (Luxembourg,Euro)

hr Croatian

hr_HR Croatian (Croatia)

hu Hungarian

hu_HU Hungarian (Hungary)

is Icelandic

is_IS Icelandic (Iceland)

it Italian

it_CH Italian (Switzerland)

it_IT Italian (Italy)

it_IT_EURO Italian (Italy,Euro)

iw Hebrew

iw_IL Hebrew (Israel)

ja Japanese

ja_JP Japanese (Japan)

ko Korean

ko_KR Korean (South Korea)

lt Lithuanian

lt_LT Lithuanian (Lithuania)

lv Latvian (Lettish)

lv_LV Latvian (Lettish) (Latvia)

mk Macedonian

mk_MK Macedonian (Macedonia)

nl Dutch

nl_BE Dutch (Belgium)

nl_BE_EURO Dutch (Belgium,Euro)

nl_NL Dutch (Netherlands)

nl_NL_EURO Dutch (Netherlands,Euro)

no Norwegian

no_NO Norwegian (Norway)

no_NO_NY Norwegian (Norway,Nynorsk)

pl Polish

pl_PL Polish (Poland)

pt Portuguese

pt_BR Portuguese (Brazil)

pt_PT Portuguese (Portugal)

pt_PT_EURO Portuguese (Portugal,Euro)

ro Romanian

ro_RO Romanian (Romania)

ru Russian

ru_RU Russian (Russia)

sh Serbo-Croatian

sh_YU Serbo-Croatian (Yugoslavia)

sk Slovak

sk_SK Slovak (Slovakia)

sl Slovenian

sl_SI Slovenian (Slovenia)

sq Albanian

sq_AL Albanian (Albania)

sr Serbian

sr_YU Serbian (Yugoslavia)

sv Swedish

sv_SE Swedish (Sweden)

th Thai

th_TH Thai (Thailand)

tr Turkish

tr_TR Turkish (Turkey)

uk Ukrainian

uk_UA Ukrainian (Ukraine)

zh Chinese

zh_CN Chinese (China)

zh_HK Chinese (Hong Kong)

zh_TW Chinese (Taiwan)

======System property========

-- listing properties --

java.runtime.name=Java(TM) 2 Runtime Environment, Stand...

sun.boot.library.path=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...

java.vm.version=1.3.0_02

java.vm.vendor=Sun Microsystems Inc.

java.vendor.url=http://java.sun.com/

path.separator=;

java.vm.name=Java HotSpot(TM) Client VM

file.encoding.pkg=sun.io

java.vm.specification.name=Java Virtual Machine Specification

user.dir=D:\java\src\char_test

java.runtime.version=1.3.0_02

java.awt.graphicsenv=sun.awt.Win32GraphicsEnvironment

os.arch=x86

java.io.tmpdir=D:\TEMPline.separator=

java.vm.specification.vendor=Sun Microsystems Inc.

java.awt.fonts=

os.name=Windows 98

java.library.path=C:\WINDOWS;.;C:\WINDOWS\SYSTEM;C:\WIN...

java.specification.name=Java Platform API Specification

java.class.version=47.0

os.version=4.90

user.home=C:\WINDOWS

user.timezone=Asia/Shanghai

java.awt.printerjob=sun.awt.windows.WPrinterJob

file.encoding=Cp1252

java.specification.version=1.3

user.name=Sicci

java.class.path=d:\java\classes

java.vm.specification.version=1.0

java.home=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_02

user.language=en

java.specification.vendor=Sun Microsystems Inc.

awt.toolkit=sun.awt.windows.WToolkit

java.vm.info=mixed mode

java.version=1.3.0_02

java.ext.dirs=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...

sun.boot.class.path=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...

java.vendor=Sun Microsystems Inc.

file.separator=java.vendor.url.bug=http://java.sun.com/cgi-bin/bugreport...

sun.cpu.endian=little

sun.io.unicode.encoding=UnicodeLittle

user.region=GB

sun.cpu.isalist=pentium i486 i386

结论1:

JVM的缺省编码方式由系统的“本地语言环境”设置确定,和操作系统的类型无关。所以当设置成相同的LOCALE时,Linux和Windows下的缺省编码方式是没有区别的(可以认为cp1252=ISO-8859-1都是一样的西文编码方式,只包含255以下的拉丁字符),因此后面的测试2我只列出了GNU/Linux下LOCALE分别设置成zh_CN

和en_US的测试结果输出。以下测试如果在Windows下分别按照不同的区域和字符集设置后试验的输出是一样的。

试验2:Java的输入输出过程中的字节流到字符流的转换过程

通过这个HelloUnicode.java程序,演示说明"Hello world 世界你好"这个字符串(16个字符)在不同缺省系统编码方式下的处理效果。在编码/解码的每个步骤之后,都打印出了相应字符串每个字符(Charactor)的byte值,short值和所在的UNICODE区间。

LANG=en_US LC_ALL=en_US

LANG=zh_CN LC_ALL=zh_CN.GBK

========testing1: write hello world to files========

[test 1-1]: with system default encoding=ISO-8859-1

string=Hello world 世界你好 length=20

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='? byte=-54 \uFFFFFFCA short=202 \uCA LATIN_1_SUPPLEMENT

char[13]='? byte=-64 \uFFFFFFC0 short=192 \uC0 LATIN_1_SUPPLEMENT

char[14]='? byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT

char[15]='? byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT

char[16]='? byte=-60 \uFFFFFFC4 short=196 \uC4 LATIN_1_SUPPLEMENT

char[17]='? byte=-29 \uFFFFFFE3 short=227 \uE3 LATIN_1_SUPPLEMENT

char[18]='? byte=-70 \uFFFFFFBA short=186 \uBA LATIN_1_SUPPLEMENT

char[19]='? byte=-61 \uFFFFFFC3 short=195 \uC3 LATIN_1_SUPPLEMENT

第1步:在英文编码环境下,虽然屏幕上正确的显示了中文,

但实际上它打印的是“半个”汉字,将结果写入第1个文件 hello.orig.html

[test 1-2]: getBytes with platform default encoding and decoding as gb2312:

string=Hello world ???? length=16

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='?' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS

char[13]='?' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS

char[14]='?' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS

char[15]='?' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS

按系统缺省编码重新变成字节流,然后按照GB2312方式解码,这里虽然打印出的是问号

(因为当前的英文环境下系统对于255以上的字符是不知道用什么字符表示的,因此全部用?显示)

但从相应的UNICODE MAPPING和SHORT值我们可以知道字符是正确的中文

但下一步的写入第2个文件html.gb2312.html,

没有指定编码方式(按系统缺省的ISO-8859-1编码方式),

因此从后面的测试2-2读取的结果是真的'?'了

[test 1-3]: convert string to UTF8

string=Hello world 涓栫晫浣犲ソ length=24

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='? byte=-28 \uFFFFFFE4 short=228 \uE4 LATIN_1_SUPPLEMENT

char[13]='? byte=-72 \uFFFFFFB8 short=184 \uB8 LATIN_1_SUPPLEMENT

char[14]='? byte=-106 \uFFFFFF96 short=150 \u96 LATIN_1_SUPPLEMENT

char[15]='? byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT

char[16]='? byte=-107 \uFFFFFF95 short=149 \u95 LATIN_1_SUPPLEMENT

char[17]='? byte=-116 \uFFFFFF8C short=140 \u8C LATIN_1_SUPPLEMENT

char[18]='? byte=-28 \uFFFFFFE4 short=228 \uE4 LATIN_1_SUPPLEMENT

char[19]='? byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT

char[20]='? byte=-96 \uFFFFFFA0 short=160 \uA0 LATIN_1_SUPPLEMENT

char[21]='? byte=-27 \uFFFFFFE5 short=229 \uE5 LATIN_1_SUPPLEMENT

char[22]='? byte=-91 \uFFFFFFA5 short=165 \uA5 LATIN_1_SUPPLEMENT

char[23]='? byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT

第3个试验,将字符流按照UTF8方式编码后,写入第3个测试文件hello.utf8.html,

我们可以看到UTF8对英文没有影响,但对于其他文字使用了3字节编码方式,

因此比GB2312编码方式的存储要大50%,

========Testing2: reading and decoding from files========

[test 2-1]: read hello.orig.html: decoding with system default encoding

string=Hello world 世界你好 length=20

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='? byte=-54 \uFFFFFFCA short=202 \uCA LATIN_1_SUPPLEMENT

char[13]='? byte=-64 \uFFFFFFC0 short=192 \uC0 LATIN_1_SUPPLEMENT

char[14]='? byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT

char[15]='? byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT

char[16]='? byte=-60 \uFFFFFFC4 short=196 \uC4 LATIN_1_SUPPLEMENT

char[17]='? byte=-29 \uFFFFFFE3 short=227 \uE3 LATIN_1_SUPPLEMENT

char[18]='? byte=-70 \uFFFFFFBA short=186 \uBA LATIN_1_SUPPLEMENT

char[19]='? byte=-61 \uFFFFFFC3 short=195 \uC3 LATIN_1_SUPPLEMENT

按系统从中间存储hello.orig.html文件中读取相应文件,

虽然是按字节方式(半个“字”)读取的,但由于能完整的还原,因此输出显示没有错误。

其实PHP等应用很少出现字符集问题其实就是这个原因,全程都是按字节流方式处理,

很好的还原了输入,但这样处理的同时也失去了对字符的控制

[test 2-2]: read hello.gb2312.html: decoding as GB2312

string=Hello world ???? length=16

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN

char[13]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN

char[14]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN

char[15]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN

最惨的就是输出的时候这些'?'真的是问号char(63)了,

数据如果是这样就真的没救了

[test 2-3]: read hello.utf8.html: decoding as UTF8

string=Hello world ???? length=16

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='?' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS

char[13]='?' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS

char[14]='?' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS

char[15]='?' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS

great! 字符虽然显示为'?',但实际上字符的解码是正确的,

从相应的UNICODE MAPPING就可以看的出来。

========Testing1: write hello world to files========

[test 1-1]: with system default encoding=GBK

string=Hello world 世界你好 length=16

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS

char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS

char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS

char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS

注意:在新的语言环境中做以上测试需要将源程序重新编译,

最早的字节流到字符流的解码过程从JavaC编译源文件就开始了,

这个测试和刚才最大的不同在于源文件中的“世界你好”这4个字是否按中文编码方式

编译导程序里的,而不是按字节方式编译成8个字符(实际上对应的是8个字节)在程序里。

[test 1-2]: getBytes with platform default encoding and decoding as gb2312:

string=Hello world 世界你好 length=16

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS

char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS

char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS

char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS

在中文环境下,解码和上面缺省的编码是一致的,因此输出一致

[test 1-3]: convert string to UTF8

string=Hello world 涓栫晫浣犲ソ length=18

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='涓' byte=-109 \uFFFFFF93 short=28051 \u6D93 CJK_UNIFIED_IDEOGRAPHS

char[13]='栫' byte=43 \u2B short=26667 \u682B CJK_UNIFIED_IDEOGRAPHS

char[14]='晫' byte=107 \u6B short=26219 \u666B CJK_UNIFIED_IDEOGRAPHS

char[15]='浣' byte=99 \u63 short=28003 \u6D63 CJK_UNIFIED_IDEOGRAPHS

char[16]='犲' byte=-78 \uFFFFFFB2 short=29362 \u72B2 CJK_UNIFIED_IDEOGRAPHS

char[17]='ソ' byte=-67 \uFFFFFFBD short=12477 \u30BD KATAKANA

其实我们用于测试的终端窗口就是一个GBK字符集的应用,

这个输出其实都是把UNICODE按GBK字符集解码的效果。

========Testing2: reading and decoding from files========

[test 2-1]: read hello.orig.html: decoding with system default encoding

string=Hello world 世界你好 length=16

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS

char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS

char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS

char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS

[test 2-2]: read hello.gb2312.html: decoding as GB2312

string=Hello world 世界你好 length=16

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS

char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS

char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS

char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS

[test 2-3]: read hello.utf8.html: decoding as UTF8

string=Hello world 世界你好 length=16

char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN

char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN

char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN

char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN

char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN

char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN

char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN

char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN

char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS

char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS

char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS

char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS

结论:如果后台数据采用UNICODE方式的存储

然后根据需要指定字符集编码、解码方式,则应用几乎可以不受前端应用所处

环境字符集设置的影响

试验2的一些结论:

所有的应用都是按照字节流=>字符流=>字节流方式进行的处理的:

byte_stream ==[input decoding]==> unicode_char_stream ==[output encoding]==> byte_stream;

在Java字节流到字符流(或者反之)都是含有隐含的解码处理的(缺省是按照系统缺省编码方式);

最早的字节流解码过程从javac的代码编译就开始了;

Java中的字符character存储单位是双字节的UNICODE;

HelloUnicode.java 原码

/*

* Copyright (c) 2002-2003 Che, Dong Email: chedongATbigfoot.com/chedongATchedong.com

* $Id: HelloUnicode.java,v 1.3 2003/03/09 08:41:46 chedong Exp $

*/

import java.io.BufferedReader;

import java.io.File;

import java.io.FileReader;

import java.io.FileWriter;

/**

* 目的:

* 测试不同字符编码解码方式对多字节编码(中文)处理的影响

* 输入:

* 可以从命令行输入测试字符串

* 输出:

* 测试1 按照不同解码方式处理字符串,并按不同编码方式写入文件

* 测试2 按照不同解码方式从文件中将字符串读出

* @author Che, Dong

*/

class HelloUnicode {

/**

* main entrance

* @param args command line arguments

*/

public static void main(String[] args) {

String hello = "Hello world 世界你好";

//read from command line input

if (args.length > 0) {

hello = args[0];

}

try {

/*

* 试验1: 从测试字符串按系统缺省编码方式解码,并写入文件

*/

System.out.println(">>>>testing1: write hello world to files<<<<");

System.out.println("[test 1-1]: with system default encoding="

+ System.getProperty("file.encoding") + "\nstring=" + hello

+ "\tlength=" + hello.length());

printCharArray(hello);

writeFile("hello.orig.html", hello);

//把字符串按GB2312解码

hello = new String(hello.getBytes(), "GB2312");

System.out.println(

"[test 1-2]: getBytes with platform default encoding and decoding as gb2312:\nstring="

+ hello + "\tlength=" + hello.length());

writeFile("hello.gb2312.html", hello);

printCharArray(hello);

//把字符串按UTF8解码成字节流,并打印相应的字节

hello = new String(hello.getBytes("UTF8"));

System.out.println("[test 1-3]: convert string to UTF8\nstring="

+ hello + "\tlength=" + hello.length());

writeFile("hello.utf8.html", hello);

printCharArray(hello);

/*

* 试验2: 从试验1的输出文件中读取,并按照不同方式解码

*/

System.out.println(

">>>>testing2: reading and decoding from files<<<<");

//first file: encoding with system default

hello = readFile("hello.orig.html");

System.out.println(

"[test 2-1]: read hello.orig.html: decoding with system default encoding\nstring="

+ hello + "\tlength=" + hello.length());

printCharArray(hello);

//second file: decoding from GBK

hello = readFile("hello.gb2312.html");

hello = new String(hello.getBytes(), "GB2312");

System.out.println(

"[test 2-2]: read hello.gb2312.html: decoding as GB2312\nstring="

+ hello + "\tlength=" + hello.length());

printCharArray(hello);

//third file: decoding from UTF8

hello = readFile("hello.utf8.html");

hello = new String(hello.getBytes(), "UTF8");

System.out.println(

"[test 2-3]: read hello.utf8.html: decoding as UTF8\nstring="

+ hello + "\tlength=" + hello.length());

printCharArray(hello);

} catch (Exception e) {

System.out.println(e.toString());

}

}

/**

* print char array

* @param inStr input string

*/

public static void printCharArray(String inStr) {

char[] myBuffer = inStr.toCharArray();

//list each Charactor in byte value, short value, and UnicodeBlock Mapping

for (int i = 0; i < inStr.length(); i++) {

byte b = (byte) myBuffer[i];

short s = (short) myBuffer[i];

String hexB = Integer.toHexString(b).toUpperCase();

String hexS = Integer.toHexString(s).toUpperCase();

StringBuffer sb = new StringBuffer();

//print char

sb.append("char[");

sb.append(i);

sb.appen, d("]='");

sb.append(myBuffer[i]);

sb.append("'\t");

//byte value

sb.append("byte=");

sb.append(b);

sb.append(" \u");

sb.append(hexB);

sb.append('\t');

//short value

sb.append("short=");

sb.append(s);

sb.append(" \u");

sb.append(hexS);

sb.append('\t');

//Unicode Block

sb.append(Character.UnicodeBlock.of(myBuffer[i]));

System.out.println(sb.toString());

}

System.out.println();

}

/**

* write content to output file

* @param fileName output file name

* @param content file content to write

*/

private static void writeFile(String fileName, String content) {

try {

File tmpFile = new File(fileName);

if (tmpFile.exists()) {

tmpFile.delete();

}

FileWriter fw = new FileWriter(fileName, true);

fw.write(content);

fw.close();

} catch (Exception e) {

System.out.println(e.toString());

}

}

/**

* read content from input file

* @param fileName input file name

* @return String file content

*/

private static String readFile(String fileName) {

try {

BufferedReader fr = new BufferedReader(new FileReader(fileName));

StringBuffer out = new StringBuffer();

String thisLine = new String();

while (thisLine != null) {

thisLine = fr.readLine();

if (thisLine != null) {

out.append(thisLine);

}

}

fr.close();

return out.toString();

} catch (Exception e) {

System.out.print(e.toString());

return null;

}

}

}

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
推荐阅读
 
 
 
>>返回首頁<<
 
靜靜地坐在廢墟上,四周的荒凉一望無際,忽然覺得,淒涼也很美
© 2005- 王朝網路 版權所有