Levenshtein distance

Given two strings, return its Levenshtein distance, i.e. the minimum number of character changes needed to mutate the first string in the second one.

You can have a look at this Levenshtein demo to get a better idea of what is expected. Notice that the demo is more complicated than the required solution. We are interested only in the "plain" string type, and we always use weight 1 for indel (insertion/deletion) and substitution operations. Moreover, we don't consider swap.

The Levenshtein distance is often used to show how Dynamic Programming works in a more complicated case than the classic make change problem. Here I present my Java solution based on the Wagner–Fischer algorithm, that asks us to build a bidimensional table generating its values starting from the most basic cases up to the required solution.

The idea of the algorithm is to put on one table axis the first string and the second one on the other one, paying attention to give an extra position at the beginning of both string. So, for instance, if we want to calculate the distance between "one" and "bone", we'll 4 rows and 5 columns (or the other way round).

Then we have to populate the tables. The top-left cell is the easiest one, it represent the case I want to pass from an empty string to another empty one. No change is needed, so I put there a zero.

To understand what I should put in the first row I can think to the case I want to calculate the distance from an empty string to another one size n. Obviously, I would increase by one the value in each cell, representing the cost of inserting a new character.

Similarly, the first column will follow the same structure. Here the increment would represent the deletion of a character.

The generic case is a bit more complicated, since we have to evaluate four different cases. Let's think to the cell (1,1). To decide which value assign to it we need to compare the first elements of both string. If it is the same, we don't have to increase the distance, we just keep the value calculate for the top left case, namely cell (0,0). Otherwise, we need to consider all the three possible editing that would let us arrive there, inserting a new character, deleting or substituting the existing one. Since we want to find the minimum distance, we'll choose the smallest among these values.

Remembering how we filled the first row and column, we should see how to get the value for inserting and deleting, it is just a matter of increasing the value contained on the cell on the left and on the top, respectively. For substitution we have to refer to the top-left cell, as we have done for the equality case.

Once we have grasped the sense of it, building the table is a piece of cake, and what we have to do is just returning the value in the bottom-right cell.

For testing purpose, and for let me better understand how things work, instead of implementing the solution with a single static method, I used a class with a constructor, that does all the dirty job, and a couple of public methods, one that returns the calculated Levenshtein distance and another one that checks the table as generated by the constructor. You could see the full Java code on GitHub, both for the Levenshtein class and its associated test cases, here I publish and comment only the ctor:
private int[][] matrix; // 1

public Levenshtein(String from, String to) { // 2
    matrix = new int[from.length() + 1][to.length() + 1]; // 3
    for (int i = 1; i < matrix[0].length; ++i) { // 4
        matrix[0][i] = i;
    }
    for (int j = 1; j < matrix.length; ++j) {
        matrix[j][0] = j;
    }

    for (int i = 0; i < from.length(); i++) { // 5
        for (int j = 0; j < to.length(); j++) {
            if (from.charAt(i) == to.charAt(j)) { // 6
                matrix[i + 1][j + 1] = matrix[i][j];
            } else { // 7
                int insert = matrix[i][j + 1] + 1;
                int delete = matrix[i + 1][j] + 1;
                int substitute = matrix[i][j] + 1;
                matrix[i + 1][j + 1] = Math.min(Math.min(insert, delete), substitute);
            }
        }
    }
}
1. Here is where I am going to store all the partial solutions and the actual one.
2. I think the first string associated to the rows in matrix, the second one to the columns.
3. The matrix has one row plus one for each character in the first string and a column plus one for each character in the second one.
4. Initialize the zeroth row and column as discussed above. No need to initialize matrix[0][0], the default value 0 is exactly what we want.
5. Loop on all the characters in both strings.
6. It's a match! Put in the relative cell in the matrix the same value stored in the top left cell.
7. Otherwise calculate how much is going to cost to put there the right value by insertion, deletion, substitution, and put there the minimum value among these three ones.

Once built the matrix, the solution appears on the bottom right cell in the matrix.

No comments:

Post a Comment