分享
 
 
 

edit distance 编辑距离

王朝other·作者佚名  2006-02-01
窄屏简体版  字體: |||超大  

Refrence : Dynamic Programming Algorithm (DPA) for Edit-Distance

编辑距离

关于两个字符串s1,s2的差别,可以通过计算他们的最小编辑距离来决定。

所谓的编辑距离: 让s1和s2变成相同字符串需要下面操作的最小次数。

1. 把某个字符ch1变成ch2

2. 删除某个字符

3. 插入某个字符

例如 s1 = “12433” 和s2=”1233”;

则可以通过在s2中间插入4得到12433与s1一致。

即 d(s1,s2) = 1 (进行了一次插入操作)

编辑距离的性质

计算两个字符串s1+ch1, s2+ch2的编辑距离有这样的性质:

1. d(s1,””) = d(“”,s1) = |s1| d(“ch1”,”ch2”) = ch1 == ch2 ? 0 : 1;

2. d(s1+ch1,s2+ch2) = min( d(s1,s2)+ ch1==ch2 ? 0 : 1 ,

d(s1+ch1,s2),

d(s1,s2+ch2) );

第一个性质是显然的。

第二个性质: 由于我们定义的三个操作来作为编辑距离的一种衡量方法。

于是对ch1,ch2可能的操作只有

1. 把ch1变成ch2

2. s1+ch1后删除ch1 d = (1+d(s1,s2+ch2))

3. s1+ch1后插入ch2 d = (1 + d(s1+ch1,s2))

对于2和3的操作可以等价于:

_2. s2+ch2后添加ch1 d=(1+d(s1,s2+ch2))

_3. s2+ch2后删除ch2 d=(1+d(s1+ch1,s2))

因此可以得到计算编辑距离的性质2。

复杂度分析

从上面性质2可以看出计算过程呈现这样的一种结构(假设各个层用当前计算的串长度标记,并假设两个串长度都为 n )

可以看到,该问题的复杂度为指数级别 3 的 n 次方,对于较长的串,时间上是无法让人忍受的。

分析: 在上面的结构中,我们发现多次出现了 (n-1,n-1), (n-1,n-2)……。换句话说该结构具有重叠子问题。再加上前面性质2所具有的最优子结构。符合动态规划算法基本要素。因此可以使用动态规划算法把复杂度降低到多项式级别。

动态规划求解

首先为了避免重复计算子问题,添加两个辅助数组。

一. 保存子问题结果。

M[ |s1| ,|s2| ] , 其中M[ i , j ] 表示子串 s1(0->i) 与 s2(0->j) 的编辑距离

二. 保存字符之间的编辑距离.

E[ |s1|, |s2| ] , 其中 E[ i, j ] = s[i] = s[j] ? 0 : 1

三. 新的计算表达式

根据性质1得到

M[ 0,0] = 0;

M[ s1i, 0 ] = |s1i|;

M[ 0, s2j ] = |s2j|;

根据性质2得到

M[ i, j ] = min( m[i-1,j-1] + E[ i, j ] ,

m[i, j-1] ,

m[i-1, j] );

复杂度

从新的计算式看出,计算过程为

i=1 -> |s1|

j=1 -> |s2|

M[i][j] = ……

因此复杂度为 O( |s1| * |s2| ) ,如果假设他们的长度都为n,则复杂度为 O(n^2)

Reference: Dynamic Programming Algorithm (DPA) for Edit-DistanceThe words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `spot' by the deletion of the `p', or equivalently, `spot' can be changed into `sport' by the insertion of `p'.

The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:

change a letter, insert a letter or delete a letter

The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:

d('', '') = 0

d(s, '') = d('', s) = |s| -- i.e. length of s

d(s1+ch1, s2+ch2)

= min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,

d(s1+ch1, s2) + 1,

d(s1, s2+ch2) + 1 )

The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.

The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.

Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.

A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:

m[i,j] = d(s1[1..i], s2[1..j])

m[0, 0] = 0

m[i, 0] = i, i=1..|s1|

m[0, j] = j, j=1..|s2|

m[i,j] = min( m[i-1,j-1]

+ if s1[i]=s2[j] then 0 else 1 fi,

m[i-1, j] + 1,

m[i, j-1] + 1 ), i=1..|s1|, j=1..|s2|

m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!

The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `spot' by the deletion of the `p', or equivalently, `spot' can be changed into `sport' by the insertion of `p'.

The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:

change a letter, insert a letter or delete a letter

The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:

d('', '') = 0

d(s, '') = d('', s) = |s| -- i.e. length of s

d(s1+ch1, s2+ch2)

= min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,

d(s1+ch1, s2) + 1,

d(s1, s2+ch2) + 1 )

The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.

The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.

Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.

A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:

m[i,j] = d(s1[1..i], s2[1..j])

m[0, 0] = 0

m[i, 0] = i, i=1..|s1|

m[0, j] = j, j=1..|s2|

m[i,j] = min( m[i-1,j-1]

+ if s1[i]=s2[j] then 0 else 1 fi,

m[i-1, j] + 1,

m[i, j-1] + 1 ), i=1..|s1|, j=1..|s2|

m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
推荐阅读
 
 
 
>>返回首頁<<
 
靜靜地坐在廢墟上,四周的荒凉一望無際,忽然覺得,淒涼也很美
© 2005- 王朝網路 版權所有