分享
 
 
 

perl 中对于GBK编码的字符的处理方法

王朝perl·作者佚名  2006-01-09
窄屏简体版  字體: |||超大  

perl 中对于GBK编码的字符的处理方法

#返回字符串长度

use String::Multibyte;

$gbk_str="上大";

$gbk= String::Multibyte->new('GBK');

$gbk_len = $gbk->length($gbk_str);

Constructor

new(CHARSET)">$mbcs = String::Multibyte->new(CHARSET)

new(CHARSET,_VERBOSE)">$mbcs = String::Multibyte->new(CHARSET, VERBOSE)

CHARSET is the charset name; exactly speaking, the file name of the definition file (without the suffix .pm). It returns the instance to tell methods in which charset the specified strings should be handled.

CHARSET may be a hashref; this is how to define a charset without .pm file.

# see perlfaq6 :-)

my $martian = String::Multibyte->new({

charset => "martian",

regexp => '[A-Z][A-Z]|[^A-Z]',

});

If true value is specified as VERBOSE, the called method (excepting islegal) will check its arguments and carps if any of them is not legally encoded.

Otherwise such a check won't be carried out (saves a bit of time, but unsafe, though you can use the islegal method if necessary).

Check Whether the String is Legal

检测字符串是否是合法的GBK字符

islegal(LIST)">$mbcs->islegal(LIST)

Returns a boolean indicating whether all the strings in arguments are legally encoded in the concerned charset. Returns false even if one element is illegal in LIST.

Length

length(STRING)">$mbcs->length(STRING)

Returns the length in characters of the specified string.

Reverse

字符串倒置

strrev(STRING)">$mbcs->strrev(STRING)

Returns a reversed string in characters.

Search

搜索

index(STRING,_SUBSTR)">$mbcs->index(STRING, SUBSTR)

index(STRING,_SUBSTR,_POSITION)">$mbcs->index(STRING, SUBSTR, POSITION)

Returns the position of the first occurrence of SUBSTR in STRING at or after POSITION. If POSITION is omitted, starts searching from the beginning of the string.

If the substring is not found, returns -1.

反向搜索

rindex(STRING,_SUBSTR)">$mbcs->rindex(STRING, SUBSTR)

rindex(STRING,_SUBSTR,_POSITION)">$mbcs->rindex(STRING, SUBSTR, POSITION)

Returns the position of the last occurrence of SUBSTR in STRING at or after POSITION. If POSITION is specified, returns the last occurrence at or before that position.

If the substring is not found, returns -1.

strspn(STRING,_SEARCHLIST)">$mbcs->strspn(STRING, SEARCHLIST)

搜索第一个串中不包含在第二个串的字符集合中的字符的位置

Returns returns the position of the first occurrence of any character not contained in the search list.

$mbcs->strspn("+0.12345*12", "+-.0123456789");

# returns 8.

If the specified string does not contain any character in the search list, returns 0.

The string consists of characters in the search list, the returned value equals the length of the string.

SEARCHLIST can be an ARRAYREF. e.g. if a charset treats CRLF as a single character, "\r\n" is a one-element list of only "\r\n". A two-element list of "\r" and "\n" can be given as ["\r", "\n"] (of course "\n\r" is also ok since the character order of SEARCHLIST doesn't matter in strspn).

strcspn(STRING,_SEARCHLIST)">$mbcs->strcspn(STRING, SEARCHLIST)

Returns returns the position of the first occurrence of any character contained in the search list.

If the specified string does not contain any character in the search list, the returned value equals the length of the string.

SEARCHLIST can be an ARRAYREF. e.g. if a charset treats CRLF as a single character, "\r\n" is a one-element list of only "\r\n". A two-element list of "\r" and "\n" can be given as ["\r", "\n"] (of course "\n\r" is also ok since the character order of SEARCHLIST doesn't matter in strcspn).

Substring

子串

substr(STRING_or_SCALAR_REF,_OFFSET)">$mbcs->substr(STRING or SCALAR REF, OFFSET)

substr(STRING_or_SCALAR_REF,_OFFSET,_LENGTH)">$mbcs->substr(STRING or SCALAR REF, OFFSET, LENGTH)

substr(SCALAR,_OFFSET,_LENGTH,_REPLACEMENT)">$mbcs->substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)

It works like CORE::substr, but using character semantics of multibyte charset encoding.

If the REPLACEMENT as the fourth argument is specified, replaces parts of the SCALAR and returns what was there before.

You can utilize the lvalue reference, returned if a reference of scalar variable is used as the first argument.

${ $mbcs->substr(\$str,$off,$len) } = $replace;

works like

CORE::substr($str,$off,$len) = $replace;

The returned lvalue is not multibyte character-oriented but byte-oriented, then successive assignment may lead to odd results.

Split

分割

strsplit(SEPARATOR,_STRING)">$mbcs->strsplit(SEPARATOR, STRING)

strsplit(SEPARATOR,_STRING,_LIMIT)">$mbcs->strsplit(SEPARATOR, STRING, LIMIT)

This function emulates CORE::split, but splits on the SEPARATOR string, not by a pattern.

If not in list context, only return the number of fields found, but does not split into the @_ array.

If empty string is specified as SEPARATOR, splits the specified string into characters.

$bytes->strsplit('', 'This is perl.', 7);

# ('T', 'h', 'i', 's', ' ', 'i', 's perl.')

Character Range

返回一定内码值区域内的所有字符的列表

mkrange(CHARLIST,_ALLOW_REVERSE)">$mbcs->mkrange(CHARLIST, ALLOW_REVERSE)

Returns the character list (not in list context, as a concatenated string) gained by parsing the specified character range.

The result depends on the the character order for the concerned charset. About the character order for each charset, see its definition file.

If the character order is undefined in the definition file, returns an identical string with the specified string.

A character range is specified with a hyphen ('-', but exactly speaking, $obj->{hyphen}).

The backslashed combinations '\-' and '\\' (exactly speaking, "$obj->{escape}$obj->{hyphen}" and "$obj->{escape}$obj->{escape}") are used instead of the characters '-' and '\', respectively. The hyphen at the beginning or the end of the range is also evaluated as the hyphen itself.

For example, $mbcs->mkrange('+\-0-9A-F') returns ('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F') and scalar $mbcs->mkrange('A-P') returns 'ABCDEFGHIJKLMNOP'.

If true value is specified as the second argument, reverse character ranges such as '9-0', 'Z-A' are allowed.

$bytes = String::Multibyte->new('Bytes');

$bytes->mkrange('p-e-r-l', 1); # ponmlkjihgfefghijklmnopqrqponml

Transliteration

搜索并且替换

strtr(STRING_or_SCALAR_REF,_SEARCHLIST,_REPLACEMENTLIST)">$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)

strtr(STRING_or_SCALAR_REF,_SEARCHLIST,_REPLACEMENTLIST,_MODIFIER)">$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)

Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.

If a reference of scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.

If 'h' modifier is specified, returns a hash of histogram in list context; a reference to hash of histogram in scalar context;

SEARCHLIST and REPLACEMENTLIST

Character ranges (internally utilizing mkrange()) are supported.

If the REPLACEMENTLIST is empty (specified as '', not undef, because the use of uninitialized value causes warning under -w option), the SEARCHLIST is replicated.

If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).

SEARCHLIST and REPLACEMENTLIST can be an ARRAYREF. e.g. if a charset treats "\r\n" (CRLF) as a single character, "\r\n" is a one-element list of only "\r\n". A two-element list of "\r" and "\n" should be given as ["\r", "\n"]. Of course "\n\r" is also ok but the character order is different; cf. strtr($str, ["\r", "\n"], ["\n", "\r"]) that swaps "\n" and "\r".

Each elements of ARRAYREF can include character ranges (the modifiers R and r affect their evaluation as usual).

["A-C", "h-z"] is evaluated like "A-Ch-z" if charset does not include grapheme "Ch". The former prevents "C" and "h" from evaluation as "Ch" even if the charset included grapheme "Ch".

MODIFIER

c Complement the SEARCHLIST.

d Delete found but unreplaced characters.

s Squash duplicate replaced characters.

h Return a hash (or a hashref) of histogram.

R No use of character ranges.

r Allows to use reverse character ranges.

o Caches the conversion table internally.

If 'R' modifier is specified, '-' is not evaluated as a meta character but hyphen itself like in tr'''. Compare:

$mbcs->strtr("90 - 32 = 58", "0-9", "A-J");

# output: "JA - DC = FI"

$mbcs->strtr("90 - 32 = 58", "0-9", "A-J", "R");

# output: "JA - 32 = 58"

# cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J';

# '0' to 'A', '-' to '-', and '9' to 'J'.

If 'r' modifier is specified, reverse character ranges are allowed. e.g.

$mbcs->strtr($str, "0-9", "9-0", "r")

is equivalent to

$mbcs->strtr($str, "0123456789", "9876543210")

Caching the conversion table

If 'o' modifier is specified, the conversion table is cached internally. e.g.

foreach (@source_strings) {

print $mbcs->strtr($_, $from_list, $to_list, 'o');

}

will be almost as efficient as this:

$trans = $mbcs->trclosure($from_list, $to_list);

foreach (@source_strings) {

print &$trans($_);

}

You can use whichever you like.

Without 'o',

foreach (@source_strings) {

print $mbcs->strtr($_, $from_list, $to_list);

}

will be very slow since the conversion table is made whenever the function is called.

Generation of the Closure to Transliterate

返回一个指向一个搜索规则的函数的引用

trclosure(SEARCHLIST,_REPLACEMENTLIST)">$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST)

trclosure(SEARCHLIST,_REPLACEMENTLIST,_MODIFIER)">$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)

Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify arguments every time.

my $trans = $mbcs->trclosure($from_list, $to_list);

print &$trans ($string); # ok to perl 5.003

print $trans->($string); # perl 5.004 or better

The functionality of the closure made by trclosure() is equivalent to that of strtr(). Frankly speaking, the strtr() calls trclosure() internally and uses the returned closure.

SEARCHLIST and REPLACEMENTLIST can be an ARRAYREF same as strtr().

CAVEAT

screen.width-500)this.style.width=screen.width-500;"

$[

This modules supposes $[ is always equal to 0, never 1.

Grapheme manipulation

Since v. 1.01, manipulation of sequence of graphemes is to be supported.

In a grapheme-oriented manipulation, notice that the beginning and the end of a string are always on a grapheme boundary.

E.g. imagine a grapheme set where a grapheme comprises either a leading latin capital letter followed by one or more latin small letters, or a single byte. Such a set can be define as below.

$gra = String::Multibyte->new({

regexp => '[A-Z][a-z]*|[\x00-\xFF]',

});

Think about $gra->index("Perl", "Pe"). As both "Perl" and "Pe" are a single grapheme, they are not equal to each other. So the result of this must be -1 (meaning no match).

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
推荐阅读
 
 
 
>>返回首頁<<
 
靜靜地坐在廢墟上,四周的荒凉一望無際,忽然覺得,淒涼也很美
© 2005- 王朝網路 版權所有