Introducing Immanix: a Java library to process XML using parser combinators

Processing XML files is a PITA. There. I said it. Processing XML files using the Java standard XML API, i.e. JAXP is even worse. There are some nicer APIs out there that alleviate this pain, like XPATH, XStream, JAXB, Castor, etc.

But still, they all fail in some point or another, ranging from being verbose and tedious to write to not being able to handle large XML streams.
The latter is a killer for many of the higher level approaches out there. Try using XPATH or any mapping library on a file weighing more than a couple of megabytes.

To handle such cases, we are usually left with StAX. Don’t get me wrong: StAX is not that bad an API, and I’d take it any day instead of JAXP even for small files. It still is a very low level API and parsing the simplest of files requires an impressive amount of code. Also, handling state with StAX is a painful exercice. You usually end up building a full-fledged state machine to do it.

It’s in dealing with such cases that I thought that there must be a better way to do this. And that’s how immanix was born.

So what is immanix ?

As said in the title, immanix is a tiny library to process XML using a parser combinators-like approach.
Parser combinators is a nifty concept that can be found in most functional languages, like Haskell and Scala. They allow building complex parsers to parse text in some grammar by combining smaller and much simpler parsers.

Immanix uses that same approach by providing a handful of atomic parsers, called matchers, and a couple more combinators that take some other matchers and yield a more complex matcher.

A matcher operates on the XML stream and returns a result and the stream in a new position. The result can be a failure in which case the stream is left in its original state or position (no events are consumed) or a success, in which case the stream is advanced to a new position (some events were consumed). The success result also yields some data. What that data is depends on the matcher.

Processing XML with immanix can be thought of as gradually combining matchers until you end up with an uber matcher that can parse the whole XML and execute user specified treatments such as writing into a database along the way.

Immanix is opensource software, available on github and is released under the MIT License.

Atomic matchers

Immanix is built on top of StAX, and thus it is able to handle arbitrarly long XML streams.
Immanix doesn’t hide its roots and hence, the atomic matchers roughly correspond to StAX events:

  • StartDocumentMatcher : Matches the start document event. Returns the raw StAX event as a result if it succeeds.
  • EndDocumentMatcher : Matches compatible mates together. Just kidding, as its name implies, matches the end document event. Returns the raw StAX event as a result if it succeeds.
  • StartElementMatcher : Matches the start element event. Returns the raw StAX event as a result if it succeeds.
  • NamedStartElementMatcher : Matches the start of an element with a supplied name event. Returns the raw StAX event as a result if it succeeds.
  • EndElementMatcher : Matches the end of an element event. Returns the raw StAX event as a result if it succeeds.
  • NamedEndElementMatcher : Matches the end of an element with a supplied name event. Returns the raw StAX event as a result if it succeeds.
  • CharsMatcher : Matches the chars event. Returns the chars (as a String) as a result if it succeeds.

Combinators

The atomic matchers by themselves are not that useful. It’s by combining them in various ways using combinators that powerful abstractions can be built.

The Succ combinator

This is one of the most basic combinators. Given two matchers A and B, it returns a new matchers that only succeeds if A succeeds then B succeeds, in which case A and B’s yielded data are combined into a pair and returned. If any of the delegate matchers fail, the resulting matcher fails and no events get consumed from the stream.

StaxMatcher<Tuple2<B, A>> aThenB = new SuccMatcher<StartElement, String>(a, b);

It is possible to use SuccMatcher to combine more than two matchers. For instance, it is possible to combine 3 matchers A, B and C by combining A and B into AB, and then combining the latter with C. Still, the resulting code can quickly get ugly and verbose, especially with the type signatures of the result’s type, which in the A, B and C example above will end up looking like this:

StaxMatcher<Tuple2<A, B>> aThenB = new SuccMatcher<A, B>(a, b);

StaxMatcher<Tuple2<Tuple2<A, B>, C>> aThenBThenC = new SuccMatcher<Tuple2<A, B>, C>(aThenB, c);

To alleviate this, immanix provides other versions of SuccMatcher that combine 3, 4, and more (up to 10) matchers and returns triplets, quadrants, etc. instead of pairs of pairs …

Using the Succ3Matcher for example, the A, B and C example can be reduced to this one liner:

StaxMatcher<Tuple3<A, B, C>> aThenBThenC = new Succ3Matcher<A, B, C>(a, b, c);

Here is a more concrete example: building a matcher that parses the following XML snippet:

<name>some name</name>

Matching this elements can be thought of matching the start of an element named name, then matching a chars event and finally the end of an element named name. Here how this can be expressed with immanix:

NamedStartElementMatcher<StartElement> s = new NamedStartElementMatcher("name");
CharsMatcher<String> n = new CharsMatcher();
NamedEndElementMatcher<EndElement> e = new NamedEndElementMatcher("name");

Succ3Matcher<Tuple3<StartElement, String, EndElement>> snippetMatcher = new Succ3Matcher<StartElement, String, EndElemen>(s, n, e);

Having clear and self explanatory class names is a good practice in general, but in this case, it results in verbose code. Besides, Java’s lack of type inference makes things worse as the type parameters have to be typed twice in the declaration and the instantiation.
To alleviate this pain, immanix provides factory methods for creating matchers. By statically importing the methods of immanix.StaxMatcher, the previous code snippet can be written :

StaxMatcher<StartElement> s = start("name");
StaxMatcher<String> n = chars();
StaxMatcher<EndElement> e = end("name");

Succ3Matcher<Tuple3<StartElement, String, EndElement>> snippetMatcher = new Succ3Matcher<StartElement, String, EndElemen>(s, n, e);

For the constructing the SuccMatchers, I deemed it more natural not to make it a call to a static method that takes the intermediate matchers but rather an instance method on the first matcher of the chain:

Succ3Matcher<Tuple3<StartElement, String, EndElement>> snippetMatcher = s.then(n, e);

Which is the same as :

Succ3Matcher<Tuple3<StartElement, String, EndElement>> snippetMatcher = s.then(n).then(e);

In practice, one rarely needs all the intermediate results of the underlying matchers when combining them.
For instance, in the previous example, we won’t need the StartElement and EndElement events as only the element’s textual content will be useful for the processing.
For such use cases, immanix provides special versions of SuccMatcher that only keeps either the first or the last intermediate matchers’ result.
One might ask what if the second matcher’s result is the one needed in a chain of three or more matchers. But as will be shown below, this can be still be achieved using only the two special combinators first and last.

The matcher that only keeps its first intermediate matcher’s result is called LeftSuccMatcher. To construct one, an instance method thenl is available in all matchers.
The matcher that only keeps its last intermediate matcher’s result is called RightSuccMatcher. To construct one, an instance method thenr is available in all matchers.

To put this into practice, and using the same example of the name element, its matcher can be written as follows:

StaxMatcher<String> snippetMatcher = s.thenr(n).thenl(e);

See what I did there ? I called thenr on the first matcher passing it the second matcher as an arguments, which yields a matcher that only keeps the second matcher’s result (String). I then called thenl on the resulting matcher passing it the third matcher as an argument, which yields a matcher that only keeps the first result, again, the String.

The Mapper

The Mapper matcher works by wrapping another matcher and working on its result, whether to perform some processing or just to transform it.

Since Mapper involves user specified code, and since Java won’t have lambdas until the end of the world (2012), it is presented as an abstract class that you have to extends to define what should be done with the underlying matcher’s result.
Continuing with the same example, say we’d like to trim the name, in case it starts and/or ends with white spaces or line breaks used for formatting. Here’s how it can be done using the Mapper matcher:

StaxMatcher<String> n = new Mapper<String, String>(chars()) {
    @Override
    public String process(String data) {
        return data.trim();
    }
};

Disjunction

XML files format may be inconsistent, i.e. change in the same file. At a certain position, we might expect an element named A or B, or even C.
To handle such cases, immanix provides a disjunction matcher called OrMatcher taht takes two matchers A and B and returns a matcher that tries A, if it succeeds it returs its result, otherwise it delegates to B and returs its result, be it a failure or a success.

The negation matcher

Thats an odd one. It takes a matcher and returns its inverse, i.e. a matcher that succeeds if its delegate fails (and returns null) and fails otherwise. It might be useful in conjunction with some repetition combinator to express things like “expect N elements not named A” for example.

This matcher can be constructed by creating a new instance of NotMatcher or by calling the instance method not() available on all matchers.

Optionality

To indicate that a matcher is optional, i.e. doesn’t cause an error if it doesn’t match, simply call the instance method optional() on it. This method is available on all matchers and returns a new matcher.

As a side note, in immanix, there isn’t a specific matcher implementation to handle the optionality. It is rather implemented by reusing another matcher, AtMostMatcher.

Repetition

Another class of combinators provided by immanix has to do with repetition. Processing XML files usually involves iterating over a list of some sort of data. To be able to process such files, the basic unit matcher must be augmented to handle multiple items.

  • NTimesMatcher : Augments a matcher so that it expects it to match N times. Aggregates the intermediate results into a java.util.List. Succeeds if and only if the intermediate matcher succeeds N times, N+1 is a failure. This matcher can be constructed by creating a new instance of NTimesMatcher or by calling the instance method nTimes() available on all matchers.
  • AtLeastMatcher : Augments a matcher so that it expects it to succeed at least N times. Throws away intermediate results so as to support large (possibly infinite) numbers of repetitions. As such, intermediate results must be processed on a one to one basis using a Mapper. This matcher can be constructed by creating a new instance of AtLeastMatcher or by calling the instance method atLeast() available on all matchers.
  • AtLeastButFiniteMatcher : Augments a matcher so that it expects it to succeed at most N times. In contrast to AtLeastMatcher, this matcher aggregates intermediate results into a java.util.List and as such must only be used when you know that the number of repetitions will be a reasonable number. Otherwise, you might end up with an OutOfMemoryError.. This matcher can be constructed by creating a new instance of AtLeastButFiniteMatcher or by calling the instance method atLeastButFinite() available on all matchers.
  • AtMostMatcher : Augments a matcher so that it expects it to succeed at most N times. Aggregates intermediate results into a java.util.List. 0 repetitions is a success. This matcher can be constructed by creating a new instance of AtMostMatcher or by calling the instance method atMost() available on all matchers.

Example:

StaxMatcher<String> name = start("name").thenr(chars()).thenl(end("name"));
StaxMatcher<List<String>> names = name.atLeast(1);

Skipping matchers

Sometimes it might be useful to skip some parts of an XML stream. It will be painful to have to build a matcher that describes these parts just for the sake of being able to build the uber matcher. That’s why immanix provides a skip matcher that advances in the events stream until a delegate matcher succeeds. This matcher is called UntilMatcher and its constructor takes the delegate matcher as an argument.
There’s a catch though. immanix’s matchers support backtracking, i.e. storing the events they consume so that they can restore the XML stream to its original state (position) in case of a failure. This can cause a problem with the UntilMatcher as an arbitrarily large number of events can be consumed and stored before the underlying matcher succeeds, which might sky rocket the memory consumption of the program and possibly killing it with an OutOfMemoryError.
As such, UntilMatcher doesn’t support backtracking and won’t restore the stream to its original state if its delegate matcher fails.
But don’t give up yet. If you really need the backtracking feature, there is another version that supports it, at a cost. Its called the UntilButFiniteMatcher. As with UntilMatcher, its constructor takes the delegate matcher as an argument, plus a number N, the give up barrier: If after reading N events the delegate matcher didn’t succeed yet, the whole matcher will fail.

There must be a catch

Immanix is still in a rough shape. I’ve successfully used it on 2 occasions to parse moderately complex XML streams (one of which wheighed 900MB) but that doesn’t mean it is perfect nor bug free.
In my opinion, immanix’s Achilles’ heel right now is error reporting. It is no fun when the uber matcher constructed by combining a couple dozen intermediate matchers fails and only prints a short error message stating that a failure occured somewhere in the matchers graph.
I’m planning to fix this eventually by maintaining something like Java’s stacktraces.

For the brave who made it to here …

I salute you. A follow-up post that I’ll publish shortly will put all this together by using a concrete example. It is not as bad as it looks, and immanix does indeed make processing XML not a punishment and still being able to handle arbitrarily large streams.

Leave a comment