Levenshtein Distance Spelling Correction

Sat 18 January 2014

As I started looking into spelling correction, I begin to stumble on several interesting essays about spelling correction. Here are what I've got about Levenshtein Distance Spelling Correction.

But first, we should learn something about Levenshtein Distance (or edit distance).

Peter Norvig's algorithm

The origin of Levenshtein Distance Spelling Correction can be traced to Peter Norvig's famous essay, How to Write a Spelling Corrector.

In Peter Norvig's algorithm, he stores his dictionary words in a python dictionary (similar to a hash table) for lookup. Then he takes the target word (which is probably a mis-spelled word) and generate a list of its spelling mutations in levenshtein distance 2. For every spelling mutation C in the dictionary, Peter Norvig would calculate the probability of C given the original word W and try to find the C with maximum probability.

Other Algorithms

Steve Hanvo's post on Levenshtein Distance approach takes the target word and search the whole dictionary and find the dictionary word with least Levenshtein Distance.

What's interesting is how the author improves it. He figured by storing the dictionary words to the trie, a great deal of cost in calculating Levenshtein Distance can be saved. This is due to the property of prefix trie and the calculation of Levenshtein Distance, which allows the reusing of calculation results. Further improvements includes compressing the trie into a deterministic acyclic finite state automaton (DAFSA) or directed acyclic word graph (DAWG) and a hash table to minimize memory usage.

The BK-Tree is a data structure in improving the dictionary searching, reducing the search time from O(n) to O(log n). Compared to the previously mentioned dictionary search, instead of searching the whole dictionary, the BK-Tree allows partial search.

Here is an interesting reddit discussion about this topic.

A post about Lucene, which features Levenshtein Automaton.

Category: Algorithm Tagged: Algorithm

comments


USACO Section 1.1 Broken Necklace

Sat 30 November 2013

Let's describe the problem briefly here. Given a string a necklace composed with Red, Blue and White beans, we are going to find the maximum number of beans we can collected if we can choose to break the necklace at a certain point. By "maximum number of beans we can collected", I mean starting from the break point, we collect beads of the same color from one end until you reach a bead of a different color, and do the same for the other end.

Category: Algorithm Tagged: C USACO Algorithm

comments

Read More
Page 1 of 1