utf8 - 王朝网络宽屏版

程序包见附件

也可参考

http://quijote.blog@bbs.nju.edu.cn

作者: quijote

标题: python程序中的中文字符处理（2003.7.11）

时间: Wed Jun 11 10:47:43 2003

点击: 22

抛砖引玉

这是我以前收集整理的。内容比较凌乱，也比较全面。

包括windows, python2.3,pyqt. 而pygtk和thinker和pyqt类似都用unicode.

我想最好的办法是做一个库直接调用gb13080编码字库.

我搜集了一个gb18030映射表 > 830k, 这样双向两个表 > 1.6 M

ZZ from linuxforum

文章标题刚学了一招。 [re: wang_jianqiang] 回复

张贴者： xlp223 (newbie)

张贴日期 01/13/03 09:56

在win2000+sp3,python2.2

from Tkinter import *

w = Button(text="中国".decode("mbcs"), font="simhei", command='exit')

w.pack()

w.mainloop()

这个方法治标不治本

有时候，我会把字符串的mbcs（GB)和unicode混淆

这个方法有个缺点，由于mbcs的缘故，只适用于windows系统.

一个解决办法，安装

http://sourceforge.net/projects/python-codecs/

A SourceForge project working on additional support for Asian codecs for use

with Python. They are in the early stages of development at the time of this

writing -- look in their FTP area for downloadable files.

（见 Python Library Reference 4.9）

略作修改即可使用

（

下载4个文件

eucgb23212utf.py (182K) ，

utf2eucgb2321.py (182K),

( http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python-

codecs/practicecodecs/ChineseCodecs/chinesecn/Attic/ )

eucgb2321_cn.py （

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python-

codecs/practicecodecs/ChineseCodecs/Python/）

test.py

本来有个setup.py, 但我不会用，手工修改：

1.把EUCGB2321_CN 替换成gb2312，包括文件名，文件里面的内容;

2. aliases.py 文件最后添加一行

# eucgb2321_cn codec

'gb2312' : 'gb2312',

3. 需要：c:\python22\lib\encodings中，新建一个目录chinesecn,

放置gb23122utf.py (182K) ，utf2gb2312.py (182K),

和 __init__.py（文件内容为空）三个文件，

4. encodings下，放置gb2312.py文件（原名是eucgb2321_cn.py ?)

）。

注释(2003.7)：

EUCGB2321_CN 是unix下汉字编码。

直接下载：

http://bbs1.nju.edu.cn/file/gb2312.rar

即可。

------------------------------------------------------------------------

运行 test.py

gbstring = "大家好"

#print gbstring

uni = unicode(gbstring, "gb2312")

gstring = uni.encode("gb2312")

print "Original gb2312 encoded string:"

print gbstring

print "Transcode to Unicode encoding:"

print repr(uni)

print "Print as a gb2312 encoded string:"

print gstring

------------------------------------------------------------

运行结果：

Original gb2312 encoded string:

大家好

Transcode to Unicode encoding:

u'\u5927\u5bb6\u597d'

Print as a gb2312 encoded string:

大家好

------------------------------------------------------------------------------

这个方法的缺点，有点麻烦（unicode(gbstring, "gb2312")），

只适用gb2312,而不是gb18030编码（没有unicode<-->gb18030 table）

我搜集了一个gb18030映射表 > 830k, 这样双向两个表 > 1.6 M

优点是通用性很好,无论windows, linux系统,还是

Tkinter, pyQT, pyGTK, wxpython都可以使用。

---------------------------------------------------------------------------

btw,

eucgb2321, 2321？ 2312? 把我搞迷糊了 ^_^

EUCGB2321_CN 是unix下汉字编码。

我原本用杜文山先生的汉化包（ http://dohao.org)，可是他并不能及时更新了,

只好另想办法。

python 开发人员的建议

寄件者：Martin v. Loewis (martin@v.loewis.de)

主旨：Re: Chinese language support of Python?

View this article only

新闻群组：comp.lang.python

日期：2002-07-07 01:01:02 PST

guidance_shanghai@yahoo.com.cn (Leon Wang) writes:

> But still can not put Chinese directly as string in source, I can not

> live with so much \u... for a whole Chinese sensence/paragraph, it's

> impossible to read and edit them

This is a known problem, and it will be addressed with PEP 263

(http://www.python.org/peps/pep-0263.html.

Meanwhile, you have the following options:

- Don't use IDLE to edit Python source code (but, say, notepad), and

only put Chinese text into string literals.

- Set the default encoding in site.py to the encoding you want to use.

- Apply patch

http://sourceforge.net/tracker/index.php?func=detail&aid=508973&group_id=957

9&atid=309579

which allows you to declare the source encoding for IDLE.

In either case, you cannot use Chinese in Unicode literals. Instead,

you should always use

unicode("chinese string", "chinese encoding")

For portability, and if your editors support it, I recommend to use

UTF-8 as the "chinese encoding".

Regards,

Martin

又一个例子, 在python2.3a1下可以运行

不再用 .encode("gb2312")了

看来python2.3对unicode的支持真的有很大改进

这个看来是目前最好的解决方法

！！！注意：编辑器使用utf-8编码，

此类文本文件一般以 FF FE 开头，在python2.2下不能运行！

经人提醒，知道可以使用windows font

exunicode.py

# -*- coding: utf-8 -*-

from Tkinter import *

w = Button(text="大家好",font=("SIMSUN",8,'bold'), command='exit')

w.pack()

w.mainloop()

3 PEP 263: Source Code Encodings

Python source files can now be declared as being in different character set

encodings. Encodings are declared by including a specially formatted comment

in the first or second line of the source file. For example, a UTF-8 file

can be declared with:

#!/usr/bin/env python

# -*- coding: UTF-8 -*-

Without such an encoding declaration, the default encoding used is ISO-8859-1,

also known as Latin1.

The encoding declaration only affects Unicode string literals; the text in the

source code will be converted to Unicode using the specified encoding. Note

that Python identifiers are still restricted to ASCII characters, so you can't

have variable names that use characters outside of the usual alphanumerics.

我刚学习使用pyQt,使用win2k,

安装了Du Wenshan兄的中文 mbcsp包, http://dohao.org

一些练习：

>>> u"我们"

u'\xce\xd2\xc3\xc7'

>>> u'阿啊'

u'\xb0\xa2\xb0\xa1'

#注意此处不是unicode码，而是 gb2312..

1. 汉字区。包括：

a. GB 2312 汉字区。即 GBK/2: B0A1-F7FE。收录 GB 2312 汉字 6763

个，按原顺序排列。

b. GB 13000.1 扩充汉字区。包括：

(1) GBK/3: 8140-A0FE。收录 GB 13000.1 中的 CJK 汉字 6080 个。

(2) GBK/4: AA40-FEA0。收录 CJK 汉字和增补的汉字 8160 个。CJK 汉字

在前，按 UCS 代码大小排列；增补的汉字（包括部首和构件）在后，按《康熙

字典》的页码／字位排列。

这也许是简体中文版win2k的原故，我猜想多国语言版的win2k不会有这样问题。

幸好，

>>> s="我们"

>>> unicode(s)

u'\u6211\u4eec'

以下是pyQt的程序：

from qt import QString

s="A string that contains just ASCII characters"

u=u"\u963f\u554a - a string with a few chinese characters"

qs=QString(s)

qu=QString(u)

print str(qs)

print str(qu)

输出结果：

>C:\Python22\pythonw -u unicode1.py

A string that contains just ASCII characters

阿啊 - a string with a few chinese characters

>Exit code: 0

改进的方法：

from qt import QString

s="A string that contains just ASCII characters"

#u=u"\u963f\u554a - a string with a few chinese characters"

u1="我们 a string with a few chinese characters"

#u=unicode(u1)

qs=QString(s)

qu=QString(unicode("我们--a string with a few chines" ))

print str(qs)

print str(qu)

输出结果：

>C:\Python22\pythonw -u unicode1.py

A string that contains just ASCII characters

我们--a string with a few chines

>Exit code: 0

另外，使用qt designer设计界面，生成*.ui文件，此文件为utf-8格式

利用python目录下qtuic.exe转换成python程序。

另外，Wenshan兄的补丁中，不知为什么，好像缺少sys.setappdefaultencoding()?

附录：

Python 多字节字符支持补充包(MBCSP) 1.0

MBCSP是针对最新的python 2.2.1 提供的多字节字符支持补充包,目的在于彻底解决

Python里边的多字节字符显示问题.原有的Python里边在处理中文、韩文或日文等多字节字

符时，常常显示不正常，你会经常看到类似于"\xc4\xe3\xba\xc3"这样的字符。尤其是处理

数据库时，经常看到这样的字符，使得观察结果显得很不方便，尽管不是错误的操作。我对

Python2.2.1的源文件进行了编辑处理，形成了MBCSP 1.0。它完全兼容Python2.2.1,对其字

符处理能力进行了加强。

MBCSP的安装方法有两种，都要求你先安装Python2.2.1。如果你想运行安装程序，可以

下载mbcsp100-py221.exe，只要按照其中的步骤一步一步执行完就可以了。第二种方法分为

三步进行，如下：

1、下载 python22.dll ，替换原来的同名文件，一般位于Windows安装目录里边的

system/system32文件夹里边。替换完成后，运行python ，你会看到窗口上方增加了一行文

字：

"With MultiByte Character Surport Surplied by dohao.org"

这表示你的python已经开始支持多字节字符了。

2、下载 site.py，替换python安装目录\lib里边的同名文件。这是为了在一些应用

程序里边支持多字节字符，例如IDLE.

3、如果你经常使用IDLE, 下载OutputWindow.py，IOBinding.py，替换Python安装目

录\tools\idle里边的同名文件。这样，当你使用IDLE时就会正常显示多字节字符了。

注意，安装后，在Tkinter里边这样显示汉字：

Tkinter.Label(text=unicode("中文汉字"))

以上的文件是针对Windows系统的。当你安装完成后，就可以用多字节字符给你的变量

名称、类名称、函数名称等命名了。当你显示数据库里边的多字节字符时，就会显示正常

了。如果你需要针对linux 系统的文件，或者是python 2.1或更早的版本，请告诉我，我将

在这里加进来。

新：MBCSP100-py213.zip

英文版