i’ve spent the last week trying to get my brain wrapped around the issues
involved with performing XML comparisons and how they might be solved. i also
decided early on in my reading and discussions that i’d better attack this
problem in stages, otherwise i’d have nothing to show for it for a very long
time. in other words, this is a non-trivial problem. at least for me <g>
so what makes this a hard problem? after all, there’s a ton of “difference”
programs out there, like diff, windiff, tkdiff, etc. why not take the two
XML-documents to be compared and run one of these programs over them and see
what it says? it will certainly tell us something, why can’t we stop there?
the short answer is that while XML is a textual format, it is more then “just
text”. it has structure too, and in order to provide a useful diff this
structure must be taken into account
the rest of this post assumes some familiarity with XML, at least at the
high-level. if you’ve never seen XML before (ie, you’ve spent the last 5 years
under a rock <g>), i’d recommend taking a quick look at
Norman Walsh’s
A Technical Introduction
to XML
let’s take a look at some real XML and see what some differences are, and
aren’t. consider the following two documents:
<foo>
<bar/>
<baz/>
</foo>
<foo>
<bar></bar>
<baz></baz>
</foo>
from a “pure text” point-of-view, these documents are different. in particular,
lines 2 and 3 are different. but from an XML point-of-view, these documents can
be considered identical. they both have a “root” element foo, which in
turn has two child elements, bar and baz. the fact that in the
first document the elements bar and baz are atomic, and in the second
they aren’t, doesn’t change the fact that they are both childless, entityless
tags. hence, the two documents are considered equal. simple, eh? maybe, but
things get complicated quick
next let’s look at the same two documents, but this time the second will now be
different (both in a text and XML way):
<foo>
<bar/>
<baz/>
</foo>
<foo>
<bar a="x"></bar>
<baz></baz>
</foo>
now things have gotten interesting. in the second document, the element
bar now has an attribute a whose value is “x”. so from an XML
point-of-view, these documents are different. “ok, so they’re different, what’s
so hard about that?”. well, the hard part is what do you display? diff
utilities not only tell you if the documents are different, they tell you
how they are different. they use different mechanisms to indicate where
and what type of differences there are. let’s take a look at some choices,
easiest to hardest:
- physical text-diff
this is the easiest to understand (and implement), but it’s the least useful
for XML. in the latter case above, both the bar and baz elements
would be shown as deleted and then inserted. this is due to the line-oriented
viewpoint of a text-diff
- logical XML-diff
now we’re doing better. the baz element would not be shown as
having changed, while the bar element would. but not the entire element
since the only change is that an attribute has been added. so rather then
having the line with the bar element seeming to be deleted and then
added, you’d want to see just the added attribute appearing
you can see an example of this on Microsoft’s
GotDotNet site where they have an online
XML Diff and Patch
tool. if you feed the two documents we’re discussing to their service, they
will perform a logical XML-diff, and display just the added attribute
high-lighted in yellow
but there’s still a problem with this, which is that while you’re now viewing
the difference between the two documents, you’re not seeing either
document. you’re just seeing the change relative to the logical structure of
the documents. why would this be a problem? well, suppose you had changed an
XML document that was being tracked under a source-code control system like
cvs. after seeing the differences, you
might want to merge just some of the changes that have occurred between the
two documents. in the logical view you don’t even know what line the
change occurred on. in a small, simple XML document this wouldn’t be that big a
deal, but what about something more involved, like a
WSDL or a
Schema file? trying to take the
changes you’re interested in and “back” then into one of the files might prove
to be not only tedious, but error-prone as well
btw, please note that i’m not trying to pick on Microsoft’s “XML Diff and
Patch” tool. it’s actually very good. i’m just trying to point out the merits of
the various types of diff strategies, and point to samples where they
exist on the web. as tools go, XML Diff and Patch is quite complete
- physical XML-diff
this one has the potential to be very useful. it would understand not just what
had changed between the two documents from the perspective of XML, but it would
also understand how to relate any discovered differences back to the physical
layout of the original document. so you’d see the original document in the
display, and any changes to be displayed
(red things being deleted,
yellow things being added) would
appear in the right place. the tricky part is how to perform a logical XML-diff
while still somehow preserving the information needed to relate difference back
correctly
having laid out the problem space somewhat, i’d like nothing better then then
to tell you i’ve got the whole thing wrapped up, here’s an implementation, now
go play with it… NOT! unfortunately this problem is kinda tricky, and i’m an
incrementalist be nature. i like to start off with the small, easy parts of a
problem, solve those to get familiar with the issues, and then build momentum
towards the hard parts (or, at least, the parts that seem hard to me)
so i decided to attack this whole area in the order i laid out above. i wrote a
physical text-diff online service
that you can play with right now, or you can
download and install it
yourself on any compatible servlet container. i’ve only run it on WLS, but it
should run elsewhere
if you know how to install a WAR (since that’s how it’s packaged) on your
servlet container, you’re all set. check your servers documentation. for WLS,
the following command (all on one line) will do the trick, assuming you’ve got
the install i documented in my last blog
post. otherwise, make the
appropriate changes for your installation
/bea/jdk141_03/bin/java -cp /bea/weblogic81/server/lib/weblogic.jar
weblogic.Deployer -adminurl http://localhost:80 -username weblogic
-password weblogic -name diff -deploy diff.war
btw, this WAR includes all the source, so you’re welcome to check it out,
modify it, etc. please let me know if you find any bugs, especially if you fix
them <g>
now let’s walk through what it does and how it works. first, you file upload
two documents. in order to perform file upload, i’m using some
Java classes written by
Jason Hunter. it’s a
very useful set of utilities that help the developer with an aspect of servlet
programming that for some reason has never been addressed by the J2EE standard
itself
let’s take a look at the file upload process in more detail. on the client, you
want a form tag whose encoding is “multipart/form-data” and method is
“post”. here’s the relevant part of the HTML
<form ENCTYPE="multipart/form-data" ACTION="phys_text_diff/" METHOD="POST">
<input name="rig" type="file" size="50"/>
<input name="rox" type="file" size="50"/>
<input value="compare" type="submit"/>
</form>
i’ve stripped out any of the formatting, so you can see just the form and the
file upload controls. i use a table for formatting, and that’s pretty gruesome
to look at. you can see the two file upload controls, the submit button and the
form tag wrapping them all
now let’s look at the server side. as you can see from the client part, these
files are arriving as part of an HTML-POST so we’ve got to have our code inside
the doPost(…) servlet method. i could also have done this inside a
JSP, but i got started writing a straight servlet and never looked back. since
that’s not really the point of this exercise, it doesn’t seem to big a deal,
but i might change that later. what do you think?
protected void doPost(HttpServletRequest req, HttpServletResponse resp)
throws java.io.IOException
{
MultipartParser mp = new MultipartParser(req, 1024 * 400);
Part part;
...
while ((part = mp.readNextPart()) != null)
{
if (part.isFile())
{
// save file contents while they're available
FilePart filePart = (FilePart) part;
...
}
}
...
}
again for clarity, i’ve stripped away some of the boilerplate in order to focus
on the details. in this case, you can see a
com.oreilly.servlet.multipart.MultipartParser being created (you have to
hand in the HttpServletRequest and the max file size you want to handle), and
then it loops over each com.oreilly.servlet.multipart.Part testing each
in turn to see if any is a “file”, and if they are, casts the Part to a
com.oreilly.servlet.multipart.FilePart and start pulling out the
contents of each file. note: make sure you drain the contents of each
Part as it’s returned. once you get the next Part, all the contents of the
previous Part are lost. so if you didn’t copy them off, they’re gone. Jason’s
doc is pretty clear on this point, but i still screwed this up in my initial
implementation
if you’re interested in other code samples and information on file-uploads,
here’s something you might want to take a
look at
now that we’ve got the files up on the server, we need to compare them and
report the differences. not surprisingly, this turns out to be a “hard”
problem. for example, let’s imagine a file 10,000 lines long, where each line
is unique with respect to all other lines. then let’s suppose you copy the file
and in the copy, delete the second line. when you compare these two files, you
want to be told that there’s one line missing in the second file. but a valid
(and useless) answer would be that 9,999 lines had been deleted, and a
different 9,998 lines had been added. in a sense, both of these answers is
correct, but only one is useful. we want that one <g>
to understand this better, i spent some time googling around. it turns out that
this problem, comparing long sequences of data to find the useful
differences comes up over and over again, in very diverse settings. for
example, research is being done to compare very long biosequences, in an area
known as Computational Biology. here’s a
paper i found on the
subject
the interesting thing about many of the papers i found is that when you
back-track the references from almost any of them, you either directly or
indirectly arrive at a paper published in 1976 by J. W. Hunt and M. D. McIlroy,
An Algorithm for Differential File Comparison. it’s pretty readable and
lays out the problem quite nicely. in a nutshell, it describes the problem as
taking two files as input and then computing the minimum number of changes
required to transform one file into the other. if this sounds familiar, this
paper was generated to document the research that went into the original
implementation of diff
while i found this information very interesting, i wasn’t looking forward to
turning this (or any of the other) papers directly into working code. i googled
around for implementations i could use, and found
GNU Diff for Java. it’s a port of
GNU Diff into Java. it’s all represented by the class bmsi.util.Diff.
it’s provided as is, and it’s license is GPL. it works, and pretty much does
exactly what i wanted. here’s how i used it
String[] rigLines = (String[]) rigList.toArray(new String[0]);
String[] roxLines = (String[]) roxList.toArray(new String[0]);
Diff diff = new Diff(rigLines, roxLines);
Diff.change change = diff.diff(Diff.forwardScript);
here you see a bmsi.util.Diff being created while taking as input two
String arrays, each array representing a file, and each element of the array
being a line from that file. next, the diff(…) method is called with an
argument indicating that the results should be returned in “forward” order,
from the top of the file to the bottom. the results themselves arrive in the
form of a linked list of bmsi.util.Diff.change objects
btw, looking at the HTML and Java, you’ll notice that i
use the word rig and rox as an identifier or prefix. some of you
will recognize this as a somewhat subtle science-fiction reference. in any case, rig is
short for “original”, and rox is short for “copy”
finally, it’s time to format the results.
int rigLine = 0;
int roxLine = 0;
while (change != null)
{
if (change.deleted != 0) // DELETED
{
// deleted - number of lines in rig
// line0 - first line deleted in rig
// line1 - line before delete in rox
if (rigLine < change.line0)
{
int clearLines = change.line0 - rigLine;
printDiffLines(pw, rigLine, roxLine, rigLines, clearLines, 0);
rigLine += clearLines;
roxLine += clearLines;
}
printDiffLines(pw, rigLine, roxLine, rigLines, change.deleted, -1);
rigLine += change.deleted;
}
if (change.inserted != 0) // INSERTED
{
// inserted - number of lines in rox
// line0 - line before insert in rig
// line1 - first line inserted in rox
if (roxLine < change.line1)
{
int clearLines = change.line1 - roxLine;
printDiffLines(pw, rigLine, roxLine, rigLines, clearLines, 0);
rigLine += clearLines;
roxLine += clearLines;
}
printDiffLines(pw, rigLine, roxLine, roxLines, change.inserted, 1);
roxLine += change.inserted;
}
change = change.link;
}
if (rigLine < rigLines.length || roxLine < roxLines.length)
printDiffLines(pw, rigLine, roxLine, rigLines, rigLines.length - rigLine, 0);
so what’s going on here? it’s not too complicated, but it took me a while to
get this just right (famous last words <g>). first, the code loops over each
item in the change object until there are no more changes to process. the first
tricky point i had to deal with is that while we’re dealing with a list of
changes, i still need to display all the lines in both files. in other words,
if the first change occurs on line 5, i’ve still gotta print lines 1 through 4
the two variables rigLine and roxLine are there to deal with
this. you can see these used inside both the “DELETED” and “INSERTED” sections
of the while loop. the if in both those sections check to see if
there are unprinted lines before the current change. if there are, they are
printed. finally, after the loop terminates, if there are any lines left they
are printed as well
at the bottom of the two sections, the lines of the change itself are printed.
and of course, anytime lines are printed, rigLine and roxLine are
incremented
well, that’s pretty much it. as you can tell, there’s still lots to do. i’ve
got lots more info about other products and libraries that take a crack at this
problem, and others besides. for example, how about supporting merge?
there seem to be several approaches to this. and how about schema? so far i
haven’t really seen anyone take on that problem. still looking. finally, what
about having this available as a web-service? and if you have any ideas,
comments or suggestions, please don’t keep them to yourself. let me know
any thoughts on the diff problem in general? or XML-diff in particular?