Solving the tree-to-tree edit problem

Challenge

How do we support written negotiation? Imagine Alice makes a lot of modifications in a document and then asks Bob to review them. To do a sound review, Bob needs to understand each individual change and agree or disagree on it. For minor changes, a classical "track changes" feature achieves this. However, when Alice re-organizes whole sections inside a deeply nested document and also edits them, "track change" is not enough. Bob loses the ability to review individual changes.

But without an understanding of the changes, Bob and Alice cannot agree.

An algorithm to enable negotiation

The challenge is then: Build an algorithm that presents all changes between two nested documents as a list of separate, intuitive changes which can be individually reviewed.

Approach

A comparison of two structured documents should result in a series of insert, delete, update, and move operations which, if all are accepted, transform one document (Bobs original version) into the other (Alices proposed version).

Make large document changes human-understandable

The algorithm needs to be smart: If Alice deletes a sentence and then re-writes the same sentence somewhere else, it's not useful for Bob to see that one sentence was deleted and a new one was created.

Instead, Bob should see that the sentence was moved - maybe with minor changes. In summary: we are looking for the most human-understandable explanation of the changes.

Literature research

In the literature, this problem is known under the name "Tree-to-tree" correction problem:

A tree here is a mathematical concept: lines connected with dots called nodes, and no loops.
Just like real trees, "no loops" means that starting on any branch, a walking bug has exactly one way off the tree.
Nested documents are mathematical trees, where the nodes are text, and the lines show how sections are nested under other sections.

To find the difference between documents, we then need to solve two problems:

How to recognize nodes that are the same, even if modified and moved?
How to present the transformation of nodes between the documents as a list of individually negotiable changes?

When are nodes the same?

In other words, the key question is: When are two nodes across different document versions the same? To answer this, we might check: Do the nodes look similar? Do they share ancestors? Are they located at similar positions in a tree? Do they share the same children?

Once we know the answer, we need to find a series of transformations that change the document from the original arrangement of nodes to a newly proposed one.

There is diverse scientific literature on this topic. There are no ready-made solutions, different algorithms such as Change Distiller or 3DM emphasize different aspects of a solution.

A pragmatic solution first, then build on it

We started out with a pragmatic solution to these questions, continuously refining the approach and adding more sophisticated functionality as we iterated a product with customers.

The current solution incorporates multiple approaches in both a cooperative and a competitive manner to find the best solution in a wide range of scenarios. Additionally, a three-way comparison gives us information about the document history, necessary for presenting changes in an even more intuitive way. The resulting changes are processed as a special "edit script" containing the individual changes. The document with the overlaid proposals remains editable.

Result

We are currently opening up TreeDiff as a programming API that your software product can consume.

Learn more