正则表达式 - 王朝网络宽屏版

正则表达式

正则表达式第一部分：

-----------------

正则表达式(REs)通常被错误地认为是只有少数人理解的一种神秘语言。在表面上它们确实看起来杂乱无章，如果你不知道它的语法，那么它的代码在你眼里只是一堆文字垃圾而已。实际上，正则表达式是非常简单并且可以被理解。读完这篇文章后，你将会通晓正则表达式的通用语法。

支持多种平台

正则表达式最早是由数学家Stephen Kleene于1956年提出，他是在对自然语言的递增研究成果的基础上提出来的。具有完整语法的正则表达式使用在字符的格式匹配方面上，后来被应用到熔融信息技术领域。自从那时起，正则表达式经过几个时期的发展，现在的标准已经被ISO(国际标准组织)批准和被Open Group组织认定。

正则表达式并非一门专用语言，但它可用于在一个文件或字符里查找和替代文本的一种标准。它具有两种标准：基本的正则表达式(BRE)，扩展的正则表达式(ERE)。ERE包括BRE功能和另外其它的概念。

许多程序中都使用了正则表达式，包括xsh,egrep,sed,vi以及在UNIX平台下的程序。它们可以被很多语言采纳，如HTML 和XML，这些采纳通常只是整个标准的一个子集。

比你想象的还要普通

随着正则表达式移植到交叉平台的程序语言的发展，这的功能也日益完整，使用也逐渐广泛。网络上的搜索引擎使用它，e-mail程序也使用它，即使你不是一个UNIX程序员，你也可以使用规则语言来简化你的程序而缩短你的开发时间。

正则表达式101

很多正则表达式的语法看起来很相似，这是因为你以前你没有研究过它们。通配符是RE的一个结构类型，即重复操作。让我们先看一看ERE标准的最通用的基本语法类型。为了能够提供具有特定用途的范例，我将使用几个不同的程序。

第二部分：

----------------------

字符匹配

正则表达式的关键之处在于确定你要搜索匹配的东西，如果没有这一概念，Res将毫无用处。

每一个表达式都包含需要查找的指令，如表A所示。

Table A: Character-matching regular expressions

格式说明：

---------------

操作：

解释：

例子：

结果：

----------------

Match any one character

grep .ord sample.txt

Will match “ford”, “lord”, “2ord”, etc. in the file sample.txt.

-----------------

[ ]

Match any one character listed between the brackets

grep [cng]ord sample.txt

Will match only “cord”, “nord”, and “gord”

---------------------

[^ ]

Match any one character not listed between the brackets

grep [^cn]ord sample.txt

Will match “lord”, “2ord”, etc. but not “cord” or “nord”

grep [a-zA-Z]ord sample.txt

Will match “aord”, “bord”, “Aord”, “Bord”, etc.

grep [^0-9]ord sample.txt

Will match “Aord”, “aord”, etc. but not “2ord”, etc.

重复操作符

重复操作符，或数量词，都描述了查找一个特定字符的次数。它们常被用于字符匹配语法以查找多行的字符，可参见表B。

Table B: Regular expression repetition operators

格式说明：

---------------

操作：

解释：

例子：

结果：

----------------

Match any character one time, if it exists

egrep “?erd” sample.txt

Will match “berd”, “herd”, etc. and “erd”

------------------

Match declared element multiple times, if it exists

egrep “n.*rd” sample.txt

Will match “nerd”, “nrd”, “neard”, etc.

-------------------

Match declared element one or more times

egrep “[n]+erd” sample.txt

Will match “nerd”, “nnerd”, etc., but not “erd”

--------------------

{n}

Match declared element exactly n times

egrep “[a-z]{2}erd” sample.txt

Will match “cherd”, “blerd”, etc. but not “nerd”, “erd”, “buzzerd”, etc.

------------------------

{n,}

Match declared element at least n times

egrep “.{2,}erd” sample.txt

Will match “cherd” and “buzzerd”, but not “nerd”

------------------------

{n,N}

Match declared element at least n times, but not more than N times

egrep “n[e]{1,2}rd” sample.txt

Will match “nerd” and “neerd”

第三部分：

----------------

锚

锚是指它所要匹配的格式，如图C所示。使用它能方便你查找通用字符的合并。例如，我用vi行编辑器命令:s来代表substitute，这一命令的基本语法是：

s/pattern_to_match/pattern_to_substitute/

Table C: Regular expression anchors

-------------

操作

解释

例子

结果

---------------

Match at the beginning of a line

s/^/blah /

Inserts “blah “ at the beginning of the line

---------------

Match at the end of a line

s/$/ blah/

Inserts “ blah” at the end of the line

---------------

Match at the beginning of a word

s/Inserts “blah” at the beginning of the word

egrep “Matches “blahfield”, etc.

------------------

Match at the end of a word

s/\>/blah/

Inserts “blah” at the end of the word

egrep “\>blah” sample.txt

Matches “soupblah”, etc.

---------------

Match at the beginning or end of a word

egrep “\bblah” sample.txt

Matches “blahcake” and “countblah”

-----------------

Match in the middle of a word

egrep “\Bblah” sample.txt

Matches “sublahper”, etc.

间隔

Res中的另一可便之处是间隔(或插入)符号。实际上，这一符号相当于一个OR语句并代表|符号。下面的语句返回文件sample.txt中的“nerd” 和 “merd”的句柄：

egrep “(n|m)erd” sample.txt

间隔功能非常强大，特别是当你寻找文件不同拼写的时候，但你可以在下面的例子得到相同的结果：

egrep “[nm]erd” sample.txt

当你使用间隔功能与Res的高级特性连接在一起时，它的真正用处更能体现出来。

第四部分：

----------------

一些保留字符

Res的最后一个最重要特性是保留字符(也称特定字符)。例如，如果你想要查找“ne*rd”和“ni*rd”的字符，格式匹配语句“n[ei]*rd”与“neeeeerd” 和 “nieieierd”相符合，但并不是你要查找的字符。因为‘*’(星号)是个保留字符，你必须用一个反斜线符号来替代它，即：“n[ei]\*rd”。其它的保留字符包括：

^ (carat)

. (period)

[ (left bracket}

$ (dollar sign)

( (left parenthesis)

) (right parenthesis)

| (pipe)

* (asterisk)

+ (plus symbol)

? (question mark)

{ (left curly bracket, or left brace)

\ backslash

一旦你把以上这些字符包括在你的字符搜索中，毫无疑问Res变得非常的难读。比如说以下的PHP中的eregi搜索引擎代码就很难读了。

eregi("^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*$",$sendto)

你可以看到，程序的意图很难把握。但如果你抛开保留字符，你常常会错误地理解代码的意思。

总结

在本文中，我们揭开了正则表达式的神秘面纱，并列出了ERE标准的通用语法。如果你想阅览Open Group组织的规则的完整描述，你可以参见：Regular Expressions，欢迎你在其中的讨论区发表你的问题或观点。

另外一篇文章

----------------------------------------

正则表达式和Java编程语言

-----------------------------------------

类和方法

下面的类根据正则表达式指定的模式，与字符序列进行匹配。

Pattern类

Pattern类的实例表示以字符串形式指定的正则表达式，其语法类似于Perl所用的语法。

用字符串形式指定的正则表达式，必须先编译成Pattern类的实例。生成的模式用于创建Matcher对象，它根据正则表达式与任意字符序列进行匹配。多个匹配器可以共享一个模式，因为它是非专属的。

用compile方法把给定的正则表达式编译成模式，然后用 matcher方法创建一个匹配器，这个匹配器将根据此模式对给定输入进行匹配。pattern 方法可返回编译这个模式所用的正则表达式。

split方法是一种方便的方法，它在与此模式匹配的位置将给定输入序列切分开。下面的例子演示了：

* 用split对以逗号和/或空格分隔的输入字符串进行切分。

import java.util.regex.*;

public class Splitter {

public static void main(String[] args) throws Exception {

// Create a pattern to match breaks

Pattern p = Pattern.compile("[,\\s]+");

// Split input with the pattern

String[] result =

p.split("one,two, three four , five");

for (int i=0; i

System.out.println(result[i]);

}

Matcher类

Matcher类的实例用于根据给定的字符串序列模式，对字符序列进行匹配。使用CharSequence接口把输入提供给匹配器，以便支持来自多种多样输入源的字符的匹配。

通过调用某个模式的matcher方法，从这个模式生成匹配器。匹配器创建之后，就可以用它来执行三类不同的匹配操作：

matches方法试图根据此模式，对整个输入序列进行匹配。

lookingAt方法试图根据此模式，从开始处对输入序列进行匹配。

find方法将扫描输入序列，寻找下一个与模式匹配的地方。

这些方法都会返回一个表示成功或失败的布尔值。如果匹配成功，通过查询匹配器的状态，可以获得更多的信息

这个类还定义了用新字符串替换匹配序列的方法，这些字符串的内容如果需要的话，可以从匹配结果推算得出。

appendReplacement方法先添加字符串中从当前位置到下一个匹配位置之间的所有字符，然后添加替换值。appendTail添加的是字符串中从最后一次匹配的位置之后开始，直到结尾的部分。

例如，在字符串blahcatblahcatblah中，第一个 appendReplacement添加blahdog。第二个 appendReplacement添加blahdog，然后 appendTail添加blah，就生成了： blahdogblahdogblah。请参见示例简单的单词替换。

CharSequence接口

CharSequence接口为许多不同类型的字符序列提供了统一的只读访问。你提供要从不同来源搜索的数据。用String, StringBuffer 和CharBuffer实现CharSequence,，这样就可以很容易地从它们那里获得要搜索的数据。如果这些可用数据源没一个合适的，你可以通过实现CharSequence接口，编写你自己的输入源。

Regex情景范例

以下代码范例演示了java.util.regex软件包在各种常见情形下的用法：

简单的单词替换

* This code writes "One dog, two dogs in the yard."

* to the standard-output stream:

import java.util.regex.*;

public class Replacement {

public static void main(String[] args)

throws Exception {

// Create a pattern to match cat

Pattern p = Pattern.compile("cat");

// Create a matcher with an input string

Matcher m = p.matcher("one cat," +

" two cats in the yard");

StringBuffer sb = new StringBuffer();

boolean result = m.find();

// Loop through and create a new String

// with the replacements

while(result) {

m.appendReplacement(sb, "dog");

result = m.find();

}

// Add the last segment of input to

// the new String

m.appendTail(sb);

System.out.println(sb.toString());

}

电子邮件确认

以下代码是这样一个例子：你可以检查一些字符是不是一个电子邮件地址。它并不是一个完整的、适用于所有可能情形的电子邮件确认程序，但是可以在需要时加上它。

* Checks for invalid characters

* in email addresses

public class EmailValidation {

public static void main(String[] args)

throws Exception {

String input = "@sun.com";

//Checks for email addresses starting with

//inappropriate symbols like dots or @ signs.

Pattern p = Pattern.compile("^\\.|^\\@");

Matcher m = p.matcher(input);

if (m.find())

System.err.println("Email addresses don't start" +

" with dots or @ signs.");

//Checks for email addresses that start with

//www. and prints a message if it does.

p = Pattern.compile("^www\\.");

m = p.matcher(input);

if (m.find()) {

System.out.println("Email addresses don't start" +

" with \"www.\", only web pages do.");

}

p = Pattern.compile("[^A-Za-z0-9\\.\\@_\\-~#]+");

m = p.matcher(input);

StringBuffer sb = new StringBuffer();

boolean result = m.find();

boolean deletedIllegalChars = false;

while(result) {

deletedIllegalChars = true;

m.appendReplacement(sb, "");

result = m.find();

}

// Add the last segment of input to the new String

m.appendTail(sb);

input = sb.toString();

if (deletedIllegalChars) {

System.out.println("It contained incorrect characters" +

" , such as spaces or commas.");

}

从文件中删除控制字符

/* This class removes control characters from a named

* file.

import java.util.regex.*;

import java.io.*;

public class Control {

public static void main(String[] args)

throws Exception {

//Create a file object with the file name

//in the argument:

File fin = new File("fileName1");

File fout = new File("fileName2");

//Open and input and output stream

FileInputStream fis =

new FileInputStream(fin);

FileOutputStream fos =

new FileOutputStream(fout);

BufferedReader in = new BufferedReader(

new InputStreamReader(fis));

BufferedWriter out = new BufferedWriter(

new OutputStreamWriter(fos));

// The pattern matches control characters

Pattern p = Pattern.compile("{cntrl}");

Matcher m = p.matcher("");

String aLine = null;

while((aLine = in.readLine()) != null) {

m.reset(aLine);

//Replaces control characters with an empty

//string.

String result = m.replaceAll("");

out.write(result);

out.newLine();

}

in.close();

out.close();

}

文件查找

* Prints out the comments found in a .java file.

import java.util.regex.*;

import java.io.*;

import java.nio.*;

import java.nio.charset.*;

import java.nio.channels.*;

public class CharBufferExample {

public static void main(String[] args) throws Exception {

// Create a pattern to match comments

Pattern p =

Pattern.compile("//.*$", Pattern.MULTILINE);

// Get a Channel for the source file

File f = new File("Replacement.java");

FileInputStream fis = new FileInputStream(f);

FileChannel fc = fis.getChannel();

// Get a CharBuffer from the source file

ByteBuffer bb =

fc.map(FileChannel.MAP_RO, 0, (int)fc.size());

Charset cs = Charset.forName("8859_1");

CharsetDecoder cd = cs.newDecoder();

CharBuffer cb = cd.decode(bb);

// Run some matches

Matcher m = p.matcher(cb);

while (m.find())

System.out.println("Found comment: "+m.group());

}

结论

现在Java编程语言中的模式匹配和许多其他编程语言一样灵活了。可以在应用程序中使用正则表达式，确保数据在输入数据库或发送给应用程序其他部分之前，格式是正确的，正则表达式还可以用于各种各样的管理性工作。简而言之，在Java编程中，可以在任何需要模式匹配的地方使用正则表达式。

JDK1.4之正規表示式

written by william chen(06/19/2002)

--------------------------------------------------------------------------------

什麼是正規表示式呢(Reqular Expressions)

就是針對檔案、字串，透過一種很特別的表示式來作search與replace

因為在unix上有很多系統設定都是存放在文字檔中，因此網管或程式設計常常需要作搜尋與取代

所以發展出一種特殊的命令叫做正規表示式

我們可以很簡單的用 "s/

因此jdk1.4提供了一組正規表示式的package供大家使用

若是jdk1.4以下的可以到http://jakarta.apache.org/oro取得相關功能的package

剛剛列出的一串符號" s/

適用於j2sdk1.4的正規語法

"." 代表任何字元

正規式原字串符合之字串

. ab a

.. abc ab

"+" 代表一個或以個以上的字元

"*" 代表零個或是零個以上的字元

正規式原字串符合之字串

+ ab ab

* abc abc

"( )"群組

正規式原字串符合之字串

(ab)* aabab abab

字元類

正規式原字串符合之字串

[a-dA-D0-9]* abczA0 abcA0

[^a-d]* abe0 e0

[a-d]* abcdefgh abab

簡式

\d 等於 [0-9] 數字

\D 等於 [^0-9] 非數字

\s 等於 [ \t\n\x0B\f\r] 空白字元

\S 等於 [^ \t\n\x0B\f\r] 非空白字元

\w 等於 [a-zA-Z_0-9] 數字或是英文字

\W 等於 [^a-zA-Z_0-9] 非數字與英文字

每一行的開頭或結尾

^ 表示每行的開頭

$ 表示每行的結尾

--------------------------------------------------------------------------------

正規表示式 java.util.regex 相關的類別

Pattern—正規表示式的類別

Matcher—經過正規化的結果

PatternSyntaxExpression—Exception thrown while attempting to compile a regular expression

範例1: 將字串中所有符合"<"的字元取代成"lt;"

import java.io.*;

import java.util.regex.*;

/**

* 將字串中所有符合"<"的字元取代成"lt;"

public static void replace01(){

// BufferedReader lets us read line-by-line

Reader r = new InputStreamReader( System.in );

BufferedReader br = new BufferedReader( r );

Pattern pattern = Pattern.compile( "<" ); // 搜尋某字串所有符合'<'的字元

try{

while (true) {

String line = br.readLine();

// Null line means input is exhausted

if (line==null)

break;

Matcher a = pattern.matcher(line);

while(a.find()){

System.out.println("搜尋到的字元是" + a.group());

}

System.out.println(a.replaceAll("lt;"));// 將所有符合字元取代成lt;

}

}catch(Exception ex){ex.printStackTrace();};

}

範例2:

import java.io.*;

import java.util.regex.*;

/**

* 類似StringTokenizer的功能

* 將字串以","分隔然後比對哪個token最長

public static void search01(){

// BufferedReader lets us read line-by-line

Reader r = new InputStreamReader( System.in );

BufferedReader br = new BufferedReader( r );

Pattern pattern = Pattern.compile( ",\\s*" );// 搜尋某字串所有","的字元

try{

while (true) {

String line = br.readLine();

String words[] = pattern.split(line);

// Null line means input is exhausted

if (line==null)

break;

// -1 means we haven't found a word yet

int longest=-1;

int longestLength=0;

for (int i=0; i

System.out.println("分段:" + words[i] );

if (words[i].length() > longestLength) {

longest = i;

longestLength = words[i].length();

}

System.out.println( "長度最長為:" + words[longest] );

}

}catch(Exception ex){ex.printStackTrace();};

}

--------------------------------------------------------------------------------

其他的正規語法

/^\s* # 忽略每行開始的空白字元

(M(s|r|rs)\.) # 符合 Ms., Mrs., and Mr. (titles)