String Matching问题

发表于 2022-05-22 更新于 2022-05-31 阅读次数： Artalk：

简介：

对于给定的两个字符串 S1 和 S2 将2个字符串进行匹配解决：S2是否是S1 的子字符串？S2和S1 如何进行“对齐”使得2个字符串最接近？等问题

最直接以及常见的使用场景例如：查询 DNA上是否含有某个基因，2个DNA是否有相似的片段（可能是相似的基因）

完美匹配问题（Matching）

问题：检测长度为n 的字符串T 上是否含有长度为m的字符串P

Input: T, P

Output: 首个matching的index

Sample Input:

1 2	abcabcabdabba abcabd

Sample Output:

匹配问题采用Knuth-Morris-Pratt KMP算法。

Multivating brute-force algorithm:

for i = 1:n:
	for j=1:m
		if s1[i+j] !=s[j]:
     break
  end
  print(i)
  end
end

这必然会造成重复计算，如何理解？我们首先拆分暴力算法的思路：

1）在S1位置上循环i = 0...n

2）在每个i位置上开始循环比对 T与P

3）如果我们重复到j发现不重复了然后在i+1位置上从头开始尝试是否匹配。

第3步就是我们复杂度高的来源，我们不需要从i到i+1的时候从头把j从0开始重新计算:

那么从j=？重新开始呢？假设在T[i]处已经匹配了p位，那么只需要找到一个k使得 P[1:k]为 P[p-k:p]也即"后缀是前缀"

KMP算法分两步：

1）构建一个索引：given任意位置，找出最大的"后缀是前缀"

2）用这个索引进行逐个的检索

code:

# 构建索引next[m]
void compute_next(string P,int next[]){
    int m = P.length();
    next[0] = 0;   // initialization
    int k = 0; // number of character matched
    int q = 1;
    while(q<m){
        while(k>0 and P[k] != P[q]){
            k = next[k-1];
        }
        if(P[k] == P[q]) k=k+1;
        next[q] = k;
        
        q++;
    }
}

#given next[m] 进行T上search P
int KMP_search(string T,string P ){
    int m = P.length();
    int n = T.length();
    
    int next[m];
    compute_next(P, next);
    int q = 0;
    
    for(int i=0;i<n;i++){
        while(q>0 and  P[q]!= T[i]){
            q = next[q-1];
        }
        if(P[q] == T[i]){
            q++;
        }
        if(q == m){
            return i-m+1;
        }
    }
    return -1;
}

对齐问题(Alignment)

想想这样一个问题，有2个string A 和B，把B改为A可以有以下几个操作：

◦ Replace a letter with another letter 改 ◦ Insert a letter 删 ◦ Delete a letter 插入

所以需要找到一个对齐方式例如：

使得修改的cost最小，为此我们首先定义每一个对齐之后修改的“得分”：

每处的得分为:

$\delta(delete),\delta(modify),\delta(insert)$

使用dp算法解决问题：

Recurrence: For $\mathrm{i}>0, \mathrm{j}>0$

V(i, j)=\max \left\{\begin{array}{cl} V(i-1, j-1)+\delta(S[i], T[j]) & \text { Match/mismatch } \\ V(i-1, j)+\delta(S[i],) & \text { Delete } \\ V(i, j-1)+\delta(\ldots T[j]) & \text { Insert } \end{array}\right.