This idea is so badass! It uses Simple Tree Matching \cite{journals/spe/Yang91} and extends it to work with HTML and then recursively searches an unseen document to align it with previously seen examples. An overview of the problem of *shift* can be seen on the left of the figure below and the alignment is shown on the right.
http://i.imgur.com/b8EzP42.png