J. Carlos Roldán / Diffchecker: comparing sequences

Sep 17, 2017 / code

Diffchecker: comparing sequences

If you've ever programmed using a repository, you've likely used a diff tool to compare old and new code, identifying which fragments have changed and which are common to both.

Diff checking example

This algorithm might seem simple until we ponder its functionality. For instance, comparing the words schwarzenegger and chuarcheneger, we get the following result:

  BEFORE:  s c h w a r z . e n e g g e r
  AFTER:   . c h u a r c h e n e g . e r
  CHANGES: - = = / = = / + = = = = - = =

In the change, we have removed 2 letters, modified 2 letters, and added one letter. This analysis, not so simple to perform with our natural intelligence, doesn't have a trivial algorithmic solution either.

Longest Common Subsequence (LCS)

In algorithmics, this problem is known as LCS. The goal of LCS is to find the longest subsequence common to two sequences.

As diff tools only compare two strings at a time, there is a solution using a divide and conquer algorithm, which reduces the problem into smaller ones. Thus, the problem can be defined as follows:

def LCS(X, Y):
    if len(X) == 0 or len(Y) == 0:
        return []
    if X[-1] == Y[-1]:
        return LCS(X[:-1], Y[:-1]) + [X[-1]]
    else:
        return longest(LCS(X, Y[:-1]), LCS(X[:-1], Y))

Where longest is a function that simply returns the longer string:

def longest(X, Y):
 return X if len(X) > len(Y) else Y

Having the LCS algorithm (which you can try in Python), it's easy to determine what has been added, deleted, or modified when comparing two strings. However, it's a very slow algorithm. Just by comparing slightly longer strings (like arn. schwarzenegger and ar. chuarcheneger), the execution time skyrockets.

Therefore, using a memory to store results is necessary to avoid computing them more than once. Try it for yourself:

Version without memory: About 7.5 secs.
Version with memory: Instantaneous even for strings 20 times longer.

This is the basis of this algorithm, although certain nuances need detailing for it to function correctly; converting the recursive algorithm into an iterative one to avoid exceeding the recursion limit, using data structures that allow more efficient computation, or tokenizing the sequences to consider that each element can be a line of code or a word instead.

Longest Common Subsequence (LCS)

Join the discussion