ICode9

精准搜索请尝试: 精确搜索
首页 > 编程语言> 文章详细

Edit distance in Java Edit distance with a scoring matrix

2020-08-18 08:00:52  阅读:258  来源: 互联网

标签:distance Java distances Edit edit int length word


Earlier, we defined edit distance as the minimal number of insertions, deletions and substitutions required to transform one string into another. But the metric can also be formulated in another way. We may assign some cost to each operation and say that edit distance is a sequence of transformations converting one string into the other having the minimal cost.

For example, we may say that each of the described operations costs 1. In this case, there is no difference between the two formulations. But sometimes it is convenient to assign costs in another way.

Assume we are working on a system for correction of spelling mistakes. Our algorithm is the following: we get a user's request, find the most similar word in a correct word database using edit distance metric, chose the most similar one and use it instead of the initial word.

Suppose we get a request "flaq". In this case, we have at least two words having the edit distance equal to 1 with the initial string: "flaw" and "flat". So, which one should we use? On the one hand, there is no difference. But on the other hand, the word "flaw" is more similar to the word "flaq", because the letters "q" and "w" are closer on a keyboard than "w" and "t" and it's more likely that the user wanted to write "flaw" and not "flat".

To process such cases correctly, one may use a so-called scoring matrix. A scoring matrix is a table mm where m[s_1][s_2]m[s1][s2] is a cost of a substitution of a symbol s_1s1 by a symbol s_2s2. For example, to solve the previous problem, we can use a matrix that assigns lower costs for symbols that are close on a keyboard and bigger costs for symbols that are far from each other.

So, your task here is to implement a simple system for correction of spelling mistakes. For convenience, we will use a shortened version of the alphabet.

Input: The first line contains a string ss a user's request. The second line contains an integer kk the size of a database. Each of the next kk lines contains a string a correct word. Each string consists of only letters \textrm{a, s, d, b, n, m}a, s, d, b, n, m.

Output: The first line should contain the edit distance d_E(s, t)d**E(s,t) where tt is a word having the minimal edit distance with ss among all other words from the database. The second line should contain a word tt itself. If there are several words with the minimal edit distance, print the one that occurs first in the database.

Consider the cost of an insertion and a deletion to be equal to 11. To calculate the cost of a substitution, use the following scoring matrix:

  a s d b n m
a 0 1 2 5 6 7
s 1 0 1 5 6 7
d 2 1 0 5 6 7
b 5 6 7 0 1 2
n 5 6 7 1 0 1
m 5 6 7 2 1 0

Sample Input 1:

aad
3
mad
sad
bad

Sample Output 1:

1
sad

Sample Input 2:

asa
3
ama
aba
ada

Sample Output 2:

1
ada
import java.util.*;

public class Main {
    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        
        String s = scanner.next();
        int k  = scanner.nextInt();
        String[] database = new String[k];
        
        for (int i = 0; i < k; i++) {
            database[i] = scanner.next();
        }
        
        String letters = "asdbnm";
        
        int[][] scoringMatrix = {{0, 1, 2, 5, 6, 7}, {1, 0, 1, 5, 6, 7}, {2, 1, 0, 5, 6, 7}, {5, 6, 7, 0, 1, 2},
                                 {5, 6, 7, 1, 0, 1}, {5, 6, 7, 2, 1, 0}};
        
        int minDistance = Integer.MAX_VALUE;
        String result = null;
        
        for (String t : database) {
            int[][] distances = new int[s.length() + 1][t.length() + 1];
            for (int i = 0; i < s.length() + 1; i++) {
                distances[i][0] = i;
            }
            for (int j = 0; j < t.length() + 1; j++) {
                distances[0][j] = j;
            }
            for (int i = 1; i < s.length() + 1; i++) {
                for (int j = 1; j < t.length() + 1; j++) {
                    int insConst = distances[i][j - 1] + 1;
                    int delCost = distances[i - 1][j] + 1;
                    int match = scoringMatrix[letters.indexOf(s.charAt(i - 1))][letters.indexOf(t.charAt(j - 1))];
                    int subCost = distances[i - 1][j - 1] + match;
                    distances[i][j] = Math.min(Math.min(insConst, delCost), subCost);
                }
            }
            if (distances[s.length()][t.length()] < minDistance) {
                minDistance = distances[s.length()][t.length()];
                result = t;
            }
        }
        
        System.out.println(minDistance);
        System.out.println(result);       
    }
}

标签:distance,Java,distances,Edit,edit,int,length,word
来源: https://www.cnblogs.com/longlong6296/p/13521288.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有