Game Development Reference
(t) = “Apples”
LD(s, t) = 1 because one insertion is required to transform s into t.
(s) = “Apple”
(t) = “Ape”
LD(s, t) = 2 because one substitution and deletion are required to transform s
This algorithm has been used in DNA analysis, plagiarism detection, speech recog-
nition, document versioning systems, and spell checking. Levenshtein distance is
also known as a generalization of Hamming distance, where only substitutions are
handled and both strings are of equal length.
One of the biggest drawbacks to the Levenshtein distance is the expensive memory
requirements of the algorithm. While the traditional matrix-based approach is
extremely precise, it is only practical to use on small datasets. As soon as the algo-
rithm is used on a large dataset, the exorbitant amount of memory and comparisons
quickly rules out the traditional Levenshtein implementation for our purposes.
For example, 50,000 bytes compared to a similar dataset will result in around 2.5
billion comparisons, and roughly 250 MB of memory.
Using Levenshtein distance would not completely satisfy our requirements, although
the algorithm definitely influenced the solution I'm presenting in this chapter. The
full source code will not be discussed in this chapter in order to make the text easier
to follow, but the Companion Web site has the full source with comments. The
only methods discussed in this chapter are extracted from the core logic of the dif-
ferencing engine. Please refer to the source code for a stronger understanding of
the implementation itself.
Generating a Difference List
The main processing component of the differencing engine is the logic that com-
pares two source buffers and outputs a sequence of transformations that can
transform source data into target data. The following methods in this section com-
pose the majority of the central logic of the engine provided with this chapter.
The following method is roughly the starting point for building the difference list.
A PatchOperation object is passed in that holds the source and target data buffers.
This method initializes a few public properties and fires the PatchOperation off to the
ProcessRange method. Later on, the PatchOperation is passed into BuildDifferencesList
that finalizes all the difference states into the transformation sequence.