Tuesday, March 20, 2018

Rich Reducers

Writing parsers is not for humans. Humans define grammars, and feed those to parsers. However, I was determined to try my hand at making a parser for XQuery in XQuery as a human, because I saw that the language is actually quite consistent. The same is true for XML and S-expressions, for example, and for those languages there already exist hand-built parsers (for example in in JavaScript, see https://github.com/isaacs/sax-js and https://github.com/fwg/s-expression respectively).

After all, how hard can it be? Well pretty damn hard, as I've worked on it on and off for two years. Yet only recently I came up with three basic utility functions that help to understand how a parser actually works. These functions are variations of the widely known reduce (or fold) function.

Looking ahead

What you often need to do when parsing a string of text – character by character – is to look at what the next character is. For example, a colon has several different meanings in XQuery, so the parser needs some context to decide if a colon is part of a qualified name, a comment or an assignment. This can only be decided by "peeking" at the next character. A "peek" method is often available under this name in a parser in some object-oriented environment, but not in a strictly functional language like XQuery. So, time to introduce a function that actually does what I need: reduce-ahead.

The function I've created in XQuery could obviously also be written in other languages. Let's start by recapping what the regular reduce function does. Given some list-like structure (e.g. an array, a sequence, or a stream), and a starting value, apply a function to each item in the list, and return the accumulated result. Example:

  1. A list: (1,2,3,4)
  2. reduce with starting value: 0
  3. A function that adds each value to the accumulative result:
    (accumulative,currentItem) => accumulative + currentItem

Now each call will return an intermediate result for each item in the list:


iterationaccumulativecurrentItem
101
212
333
464

After the last function call, the result will of course be 10.

My reduce-ahead function simply adds an extra parameter to the accumulating function, that contains the next item in the list. With this function we could return another value depending on which item comes next in the list. The following function doesn't add when the next item equals 4:
(accumulative,currentItem,nextItem) => nextItem == 4 ? accumulative : accumulative + currentItem


iterationaccumulativecurrentItemnextItem
1012
2123
3334
434

Now the outcome will be 7.

This functionality is easily extrapolated to a parser, where it can look ahead whenever a character should be disambiguated. Other variants in the same spirit are reduce-behind, that provides the previous item in the list, and reduce-around, that provides both the next and previous item.

Especially when reducing streams this is quite useful, because you can't simply (read: efficiently) inspect a stream by using an index. Apart from hand-made parsers other use cases for this powerful principle may pop up in the future.

JavaScript implementation: https://gist.github.com/wshager/df8e0c67281068cd5d01f30452df7287

No comments:

Post a Comment