Levenshtein Distance Spelling Correction
Sat 18 January 2014
As I started looking into spelling correction, I begin to stumble on several interesting essays about spelling correction. Here are what I've got about Levenshtein Distance Spelling Correction.
But first, we should learn something about Levenshtein Distance (or edit distance).
Peter Norvig's algorithm
The origin of Levenshtein Distance Spelling Correction can be traced to Peter Norvig's famous essay, How to Write a Spelling Corrector.
In Peter Norvig's algorithm, he stores his dictionary words in a python dictionary (similar to a hash table) for lookup. Then he takes the target word (which is probably a mis-spelled word) and generate a list of its spelling mutations in levenshtein distance 2. For every spelling mutation C in the dictionary, Peter Norvig would calculate the probability of C given the original word W and try to find the C with maximum probability.
Other Algorithms
Steve Hanvo's post on Levenshtein Distance approach takes the target word and search the whole dictionary and find the dictionary word with least Levenshtein Distance.
What's interesting is how the author improves it. He figured by storing the dictionary words to the trie, a great deal of cost in calculating Levenshtein Distance can be saved. This is due to the property of prefix trie and the calculation of Levenshtein Distance, which allows the reusing of calculation results. Further improvements includes compressing the trie into a deterministic acyclic finite state automaton (DAFSA) or directed acyclic word graph (DAWG) and a hash table to minimize memory usage.
The BK-Tree is a data structure in improving the dictionary searching, reducing the search time from O(n) to O(log n). Compared to the previously mentioned dictionary search, instead of searching the whole dictionary, the BK-Tree allows partial search.
Here is an interesting reddit discussion about this topic.
A post about Lucene, which features Levenshtein Automaton.