Sunday, April 23, 2017

JSON, the final fronteer: handling semi-structured data the JavaScript way

One of the challenges in javascript remains that there's no convenient way to traverse JSON. Subsetting or modifying complex hierarchies usually requires some sort of custom recursive strategy, which always leads to heavy mental exercise and many errors. Since JSON has pushed XML out of the web stack, there's still some XML legacy to be dealt with. So why not scour the pile of remaining junk for some nifty scrap parts to reuse? Let's travel back to Tatooine...

XPath seems to have had an important role in traversing the DOM, the abstract Document Object Model resulting from interpreting XML documents. It certainly had some useful features that CSS, for instance, lacks, like traversing back up a tree. Unfortunately, the W3C recommendation of XPath is not suitable for JSON.

You might argue that the W3C deemed XPath unfit for JSON, but perhaps it was never given a fair chance. There has been an initiative to create an XPath-like DSL for JSON called JSONPath, but it has probably been collected by a sandcrawler. In this post I would like to explore both the possibility of XPath traversal for JSON and an implementation in javascript.

Let's introduce some data:
<contrib-group> <contrib contrib-type="author"> <name name-style="western"> <surname>McCrohan</surname> <given-names>John</given-names> </name> <aff>Center for Devices and Radiological Health</aff> </contrib> </contrib-group>
By Jabba's beard, that's XML! Haven't seen that on the web since 2013! Yep, and the XPath to retrieve the author's given-names from these ancient runes would have been

//given-names

Wow. Try doing that in javascript! Now, let's inspect what really happened here. The double slashes actually shortcut the expression straight down to all elements named "given-names", in the order that they appear in the document. This is called an axis-step, and expanded the expression looks like:

/descendant::given-names

Furthermore, instead of the string "given-names", the engine finds a curse word called a QName, which is a "qualified name" in a certain "namespace". Ugh. Never mind, it just means that the engine searches for a node of type element, 'cause those are denoted by a QName. So it will test for that type:

/descendant::element(given-names)

Is there a way to write this expression for JSON? Let's see and have a go at modeling the same data. First, we can get rid of the group-level, because, hey, we have arrays. The root doesn't need a name, so we'll just write an object, and since arrays could contain multiple contributors, let's refer to it as "contribs".

{"contribs":[]}


Secondly, we have to find a place to put that abomination called an attribute.

{"contribs":[
  {"type":"autor"}
]}


Easy! Now the rest.

{"contribs":[
  {
    "type": "autor",
    "name": {
      "style": "western",
      "surname": "McCrohan",
      "given-names": "John"
    },
    "aff": "Center for Devices and Radiological Health"
  }
]}
Here's a straightforward mapping, glossing over some details I'll address later on. How to find the same data with our simple expression? Can we just have the expression address the key "given-names" in our "contrib" object? Not quite. While modeling the data as JSON we actually shifted the tree one level. Instead of mapping the "given-names" element to an object, we actually mapped it directly to a string!

Object values are not always objects. And in XML, something called text nodes existed, but they didn't have QNames! Let's just pretend that the string value is a text node and that we can just select any node that satisfies our name test. In JSON, a node can be anything: objects, arrays, strings, numbers and booleans. Note also that object keys need not be a QName, and that they may also have other types, like numbers.

/descendant::node(given-names)

We just ruined the beautiful old datamodel for our selfish purposes... but created a cool new spaceship while doing so. At least now we can take it all the way without trying to be correct. Instead of putting our "given-names" expression in hyperdrive to select all descendants, we could've also been more precise and navigate the tree one step at a time. Since the engine doesn't have to regard all nodes in a document, this can give a huge performance boost. For the XML fragment this would be:

/contrib/name/given-names

In JSON you have to be aware that items in an array don't have names, so we should select it by its index, in this case being the item at index 0. However, the XML people were afraid that to start counting at zero would certainly confuse some Ewoks, so they decided to start counting at one. Since their wisdom was unfathomable let's just go with it:

/contribs/1/name/given-names

Or, expanded:

/child::node(contribs)/child::node(1)/child::node(name)/child::node(given-names)

Finally, I'd like to look at some more arcane XPath stuff that I would call fat filters. In javascript we'd have to either create a loop to filter or use the built-in filter method on arrays. But in XPath we get that for free. In addition we get a reference to the context, in the form of a dot, but also some context-dependent functions, for example "position", that refers to the position of the item we're filtering. If we would have a nodeset that contains multiple contribs, and all we wanted is those of type "author", we could write the following for XML:

/contrib[@contrib-type = "author"]

Expanded this looks more or less like

/child::element(contrib)
+
filter(equals(./attribute::attribute(contrib-type),"author"))

Very straightforward. Like any filter, the engine will simply test the node in the current selection by calling a function that returns either true or false, leaving out the ones that return false. However, in the case of JSON the context we selected above would refer to the array, not an item in it. Luckily the old XML gods have foreseen this and provided a wildcard to select any node for the current axis step. It's like a little bright star at the horizon.

Normally we associate wildcards with name globbing, and not so much with an range of numbers. But droids are smart, right? So let's just have the engine figure out what to do when it encounters a wildcard. In case it encounters an array, it will simply convert the array to a sequence of objects. A sequence looks and feels somewhat like an array, but is actually a monadic type... Very dark and ancient magic indeed, not for the faint of heart and certainly outside of the scope of this galaxy. Suffice it to say a sequence can contain zero, one or more items, and everything is a sequence. Anyway, we can now filter anything:
/contribs/*[type = "author"]

Or
/child::node(contribs)/child:node(*)
+
filter(equals(./child::node(type),"author"))

Phew, point made! Just makes you wonder: how the hell could anyone come up with something so alien as JSONPath? We may never know. They were probably from a planet destroyed by a death star. Mercenaries perhaps, who cares. On to the last phase, implementing it into your brain.

To give the functionality an appropriate name I'm just going to go with "select", since we're selecting something. A handy helper will be a seq function, a factory making Sequences. We're also going to need a way to handle sequences, since that's what the select function will return. So instead of using the built-in javascript string comparison, we need a function "equals", that will compare a sequence with something else. I'm proud to present the "select" function:
function select(jsonDocument, ...paths) => Sequence

Where paths is an array of "steps" through the tree, each performed in turn on a smaller selection. Next we'll need to have some axis functions, like "child" or "descendant" above, but there's actually 13 axes in total. Let a step be some combination of an axis and a node type test. Both will be identified by our engine to perform the necessary operations on the current context. The first two examples from our exploration would translate to javascript like this:
select(contribData, seq(descendant(),node("given-names")))

select(contribData, seq(child(),node("contribs")), seq(child(),node(1)), 
  seq(child(),node("name")), seq(child(),node("given-names")))

For the filter example, we could just filter the result from a selection by writing something like:
filter(
  select(contribData, seq(child(),node("contribs")), seq(child(),node("*"))), 
  function(context) {
    return equals(select(context,seq(child(),node("type"))), "author");
  }
)

However, with XPath it's possible to just continue making the selection narrower, so it would be nice if we could provide the "select" function with some mechanism that doesn't filter the entire result right away, but processes the current subset once the step is encountered. For this we could make filter a function with just the filtering part, and not the input yet.
select(contribData, seq(child(),node("contribs"), seq(child(),node("*")), 
  filter(function(context) {
    return equals(select(context,seq(child(),node("type"))), "author");
  })
)

We could also create a filter method on Sequences, which is fine but less in the spirit of our XML ancestors. Furthermore, there's a lot more going on with types than I can describe here, as the XPath datamodel uses the entire XML Schema typeset. So keeping this a bit more open-ended may well prevent trouble later.

I promised to come back to some details with the test data I glossed over. I didn't mention that the XML fragment is part of JATS, the journal article tag suite standard for publishers. In this standard, believe it or not, the "aff" element, standing for affiliation of a contributor, is allowed to appear both under the level of the contrib-group and under a contrib! Woah, for real? That means our mapping to JSON was rather incomplete, not to mention plain wrong, because we changed the contrib-group to a contribs array, and since arrays aren't objects, we can't put the "aff" key under the "contribs".

Ok, we had been warned... but how can we fix it?! Well, arrays can't have strings as indices, but objects certainly can have numbers as keys. So, we could turn the array into an object and use numbers as keys:
{"contribs":{
  1:{
    "type": "author",
    "name": {
      "style": "western",
      "surname": "McCrohan",
      "given-names": "John"
    }
  },
  "aff": "Center for Devices and Radiological Health"
}}

Perfect! Everything still works. One more minor thing still: that the "aff" element could appear at two levels suggests that besides the "contrib" element, the "contrib-group" element was repeatable as well, so you could group contributors with the same affiliation. This means our model is incorrect at that level now too. Blast! How clever, these XML folks! Well, no way around it anymore, we just have to accept the fact that XML was more flexible than JSON in this respect. Perhaps we could just  introduce a minor convention to model elements in JSON, so we could still use the XPath engine we just created.

For starters, we could use some reserved keys to model QNames, attributes and child nodes. Let's use dollar signs, because they're on your keyboard (I think), aren't allowed in XML and are already reserved elsewhere in JSON land:
{
  "$qname": "contrib-group",
  "$children": [{
    "$qname": "contrib",
    "$attrs": {"contrib-type": "author"},
    "$children":[{
      "$qname":"name",
      "$attrs": { "name-style": "western"},
      "$children": [{
        "$qname": "surname",
        "$children": ["McCrohan"],
      },{
        "$qname":"given-names",
        "$children": ["John"]
      }]
    }]
  }]
}

Oh dear... This example is already quite terse, but it clearly shows how much more memory was needed back when XML was the preferred format. And all this just so that some information could be repeated without creating redundancy... Hey, wait a minute, we don't have much redundancy nowadays anyway. And even before XML this wasn't a problem, when we already had the same tabular data we still use today. Because redundancy is minimized by creating relationships between pieces of data we want to reuse, right?

So why did the JATS folks did it like this? Perhaps because they thought someone would be editing this by hand, and could introduce minor discrepancies when repeating information. Editing XML by hand... cruel times indeed. Or perhaps it was done because they already thought about the presentation of the data, and it would be unexpected for a reader to find the same information repeated, and so they forgot about separation of concerns. We will never know. It seems modeling data has always been a challenge, and nobody gets it right the first time.

But that's for another chapter in this very, very, very, very, very, very, very, very long saga.

No comments:

Post a Comment