Wednesday, April 18, 2018

Trie Demo

This demo I made for a job application. Never heard from the guy again, but at least I have a nice demo.

https://codepen.io/wshager/full/zWBzMV/

It says "directed acyclic word graph", which I think is the correct term, but I just found out on Wikipedia that this is indeed a trie. The difference is illustrated quite precisely, but I think it's really the same thing. I didn't use an existing algorithm, but I just made this up myself. I didn't study computer science (and I'm mostly glad I didn't). If you did, let me know if it's any good.

Tuesday, March 20, 2018

Rich Reducers

Writing parsers is not for humans. Humans define grammars, and feed those to parsers. However, I was determined to try my hand at making a parser for XQuery in XQuery as a human, because I saw that the language is actually quite consistent. The same is true for XML and S-expressions, for example, and for those languages there already exist hand-built parsers (for example in in JavaScript, see https://github.com/isaacs/sax-js and https://github.com/fwg/s-expression respectively).

After all, how hard can it be? Well pretty damn hard, as I've worked on it on and off for two years. Yet only recently I came up with three basic utility functions that help to understand how a parser actually works. These functions are variations of the widely known reduce (or fold) function.

Looking ahead

What you often need to do when parsing a string of text – character by character – is to look at what the next character is. For example, a colon has several different meanings in XQuery, so the parser needs some context to decide if a colon is part of a qualified name, a comment or an assignment. This can only be decided by "peeking" at the next character. A "peek" method is often available under this name in a parser in some object-oriented environment, but not in a strictly functional language like XQuery. So, time to introduce a function that actually does what I need: reduce-ahead.

The function I've created in XQuery could obviously also be written in other languages. Let's start by recapping what the regular reduce function does. Given some list-like structure (e.g. an array, a sequence, or a stream), and a starting value, apply a function to each item in the list, and return the accumulated result. Example:

  1. A list: (1,2,3,4)
  2. reduce with starting value: 0
  3. A function that adds each value to the accumulative result:
    (accumulative,currentItem) => accumulative + currentItem

Now each call will return an intermediate result for each item in the list:


iterationaccumulativecurrentItem
101
212
333
464

After the last function call, the result will of course be 10.

My reduce-ahead function simply adds an extra parameter to the accumulating function, that contains the next item in the list. With this function we could return another value depending on which item comes next in the list. The following function doesn't add when the next item equals 4:
(accumulative,currentItem,nextItem) => nextItem == 4 ? accumulative : accumulative + currentItem


iterationaccumulativecurrentItemnextItem
1012
2123
3334
434

Now the outcome will be 7.

This functionality is easily extrapolated to a parser, where it can look ahead whenever a character should be disambiguated. Other variants in the same spirit are reduce-behind, that provides the previous item in the list, and reduce-around, that provides both the next and previous item.

Especially when reducing streams this is quite useful, because you can't simply (read: efficiently) inspect a stream by using an index. Apart from hand-made parsers other use cases for this powerful principle may pop up in the future.

JavaScript implementation: https://gist.github.com/wshager/df8e0c67281068cd5d01f30452df7287

Friday, March 16, 2018

Pleidooi voor een meer formele benadering van programmeren

Programmeren is zowel een wetenschap als een gereedschap. De "man van de praktijk" loopt gaandeweg tegen problemen aan die formeel allang verkend zijn. Deze zaken zijn veelal uitgewerkt in ingewikkelde wetenschappelijke publicaties. Dat niet iedereen die leest is begrijpelijk, maar veel kennis is ook geïmplementeerd in functionele talen, als Haskell, Scala en recente LISP/ML varianten. Om inzicht te krijgen in fundamentele concepten is het verstandig om contact te houden met de ontwikkelingen in dergelijke talen. Ook in Javascript begint deze kennis door te dringen, al is dit geen zuivere functionele taal...

Kennis van zuiver functioneel programmeren opdoen kan door zelf een taal te leren of door bronnen te raadplegen op het WWW, zoals diverse blogs en tutorials (er zijn ook veel videos) of hulp te vragen van experts op StackOverflow. Het doel is niet om zo'n taal te gebruiken, maar om concepten te doorgronden die eraan ten grondslag liggen. Dit geldt overigens ook voor XQuery, een taal met een veel minder steile leercurve dan bovengenoemde. Maar XQuery heeft wel een aantal beperkingen, die blijven voortbestaan door het isolement waarin de taal zich bevindt, namelijk in de impopulaire wereld van XML. Echter is de taal niet in isolement ontstaan, en er zijn ook hierin verborgen juweeltjes die je kunt vinden door simpelweg met de taal te werken.

De verborgen formele oorsprong van XQuery zit in zaken als:

  • de monadische aard van sequences: alle data is veilig “ingepakt” zonder dat je daarover hoeft na te denken
  • for-comprehensions: ingepakte data zijn intuitief uit te spreiden en zo te verwerken
  • parallelle computatie: sequences lenen zich uitstekend voor impliciete luie evaluatie en het afhandelen van asynchrone processen

Ook Michael Kay van Saxon, de bekendste en meest gebruikte implementatie van XQuery (en XSLT), heeft toegegeven dat een XQuery implementatie zonder enige vorm van luie evaluatie slecht zal performen. Een voorbeeld van een implementatie zonder luie evaluatie is eXist-db: er is geen verschil tussen "complexe-computatie()" en "complexe-computatie()[1]". De functie wordt in zijn geheel uitgevoerd, voordat alleen het eerste item in de sequence wordt uitgevraagd. Kay maakt zelf overigens ook fouten, volgens mij door de theorie niet goed genoeg te bestuderen (hij noemt zichzelf beslist praktisch). Er is wel een formele specificatie van XQuery, maar die is weinig leerzaam in dit opzicht:
https://www.w3.org/TR/xquery-semantics/

De voordelen van XQuery zijn tegelijkertijd ook een nadeel. Het maakt gebruik van formele concepten, die zijn weggestopt voor de "eindgebruiker". Dit kan enerzijds gebruiksgemak vergroten, maar zorgt er anderzijds voor dat je je eigenlijk nauwelijks ontwikkelt als programmeur. Ik denk dat het beter is om zaken op de één of andere manier expliciet te maken of ergens uit te leggen. Maar ik denk nog meer dat het alleen mogelijk is om concepten te doorgronden door ze zelf te bouwen. Dat is niet mogelijk in een taal die is geïmplementeerd als black box. We moeten daar dus af en toe uit zien te komen.

Ik zie in JavaScript een zeer toegankelijke taal, die door velen wordt omarmt en waarin het mogelijk is om allerlei concepten uit te werken, voor zover ze nog niet zijn toegevoegd aan de taal zelf. Ik zou dit graag met anderen samen doen die JavaScript beheersen en openstaan voor een meer formele benadering, maar tegelijk realiseer ik mij dat iedereen hierin zijn eigen ontwikkelingsweg heeft te gaan. Ik denk echter dat het op dit moment voor iedereen die van software zijn beroep heeft gemaakt, noodzakelijk is om de volgende zaken te verkennen:

  • Reactive Extensions for Javascript / Observables / Streams
  • Immutable datastructuren
  • Een vorm van type-aanduiding
  • Code-generatie op basis reeds bekende, overzichtelijke tools, zoals formulieren, spreadsheets of gestructureerde editors, vooral in plaats van de ouderwetse database / CMS / CRM


Ik denk dat we zelf begrip moeten opbrengen voor deze zaken, anders raakt de kennis uit ons blikveld. Niet iedereen zal op hetzelfde moment en met dezelfde controle de kennis aan kunnen wenden, maar waar het nu om draait in software is precies bekend. Net als in zelfverdediging zijn er maar een paar dingen die je moet weten om jezelf enigszins te kunnen verdedigen. Je kunt zeggen: "ik bouw software, maar ik hoef niet alles zelf te begrijpen". Je kunt je uiteraard aan de confrontatie onttrekken... totdat er toch iets op je pad komt. Maar dan heb je in ieder geval de noodzakelijke gereedschappen om daarmee om te gaan.


"RxJS is een hamer"

Sunday, May 14, 2017

Forms as state containers Part 4: form generation

This is a small series on HTML forms. The previous post was Forms as state containers Part 3: validation.

With form generation we leave the realm of functional form controls. In addition to barebones client logic, we have to focus on the User. Many programmers tend to forget that. However, the devil is in the details. You may just want to add a simple form to capture an email address for a newsletter subscription, or an entire multi-lingual webshop with checkout system, it's all the same, really. No matter the use case, the accessibility, look & feel and ease of use of the app are just as important as the validity of the data. Unfortunately, this is were many applications fail, as we've all experienced at some point.

There's not one way to solve the "forms problem", as there's an insurmountable gap between user, designer and programmer. In my opinion, the ultimate goal is: enough flexibility, with the ability for designers to rapidly create working prototypes. But how to cater to this, without resorting to some wysiwyg editor that just spits out semi-structured garble?

Clearly, designers shouldn't be bugged about the stuff developers want, like data types and program logic. So, hereby I distance myself from things like JSON Schema-based form generation, as it's too far removed from the common use case. Instead, I propose to include specialized pieces of markup that provide a sensible starting point for designers to work with, as I will outline below.

Instead of just using the built-in HTML form controls, I find it more useful to explain a form in terms of structure. It shouldn't be hard to explain to a non-programmer what is a text value, a fieldset or a repeated item. Like mentioned earlier, the link is also an important structure for generating forms. Finally, a structure that isn't even available in JSON Schema is an element, even though it's the kind of structures we're building!

With structural types it's possible to generate forms as well as validate them, without having to build an entirely new system. These structures can be coerced to real data types later on, as is exactly the way form controls like input work anyway. Why not use elements like input and other HTML directly, you ask? Well, because we're actually programming! But we don't want the designers to know that ;-) Seriously, though, using HTML is fine, but to make the distinction is vital. Allow me to explain.

Once there was a programming language written purely in markup... Whoah, really?! Yep, there's a language that actually required writing markup by hand. It's called XSLT. One of the problems with XSLT, however, is that it requires tons of knowledge on how to use it. In fact, it's so complex, that although a language never needed a user interface more badly, it can't ever be built (not really, anyway). 

I want to create something simple, a small subset that's just for the purpose of creating forms programmatically, but written in markup, so it can be processed like... well, however you normally process HTML (you know, by hand?). Also, XSLT is XML, which is considered to be too strict for the web, so nowadays we have HTML5... which is exactly the same thing. Oh well, nobody seems to have another answer to XML yet, but that's a discussion for another time. First, let me present "XSLT in HTML".

<fieldset name="personal-info">
  <x name="firstname" />
  <x name="lastname" />
  <x name="email-address" />
  <x list="someid" name="hobbies[]" />
  <link href="/datalist#someid" rel="import" />
</fieldset>


Wait, isn't it kinda like the "schema" in my previous post? It is, but I replaced input with x, because it denotes a "text value" structure. The input is only one control that provides a text value, but there's also select (or, rather, selectOne), textarea, and maybe there's more. Instead, I use x, which is short for "text value". The above piece of markup can be inserted in an HTML document, and when processed with my specialized software, it can simply be expanded into a form that has some basic layout. Additional elements like labels are generated on the fly, just like you would with any "web component".

While the designer can decide upon the appearance of a form, the developer can add data types and constraints that are required for validation. I think a designer is helped with the ability to select the appearance from a template. For example, an enumerated control takes some preset options from a list, but designers shouldn't have to concern themselves with how those options arrive in the form. They shouldn't have to decide between a select or a radio button group: when there are few options, a radio button group will suffice, when there are too many options to view at once, another solution is obvious.

When a more fine-grained approach is required, the designer should be able to style the constituent parts of a form component. In that case, the expansion of the component's content could be further specified, instead of relying on the default. 

Since our "program" is just markup, we have a lot of flexibility. In fact, I'd argue that this will one day be part of HTML, as a standard way of creating logic that is simpler than javascript, which has become a general purpose language. And while that may make programmers happy, it's too complex for the common use case.

You may not be very impressed by any of this, but just wait for the next and last post in this series, and I promise you will be.

Saturday, April 29, 2017

Forms as state containers Part 3: validation

This is a small series on HTML forms. The previous post was Forms as state containers Part 2: managing complex data.

A form is a schema

Since HTML5 there are a lot more attributes avaiblable on form components for expressing constraints. This means forms can nowadays be used as a basic schema for validating data. However, HTML5 form components are a motley crew, so we need some guiding principle to actually make sense of it all. JSON Schema provides a coherent set of constraints for JSON data. And because JSON is just javascript, JSON Schema should be readily applicable to HTML forms as well. You may not know or like JSON Schema, or you may not want to learn it because it isn't a standard (yet), so in this post I'll try to patch up things by merging JSON Schema into the HTML forms standard.

To start it off, some very useful keywords from JSON Schema actually translate directly to HTML5 form constraints. For this purpose I'm considering the version 0.4 draft.


JSON SchemaHTML5 form constraintPurpose
maximummaxUpper limit for a numeric instance
minimumminLower limit for a numeric instance
maxLengthmaxlengthMaximum length of characters in a string1
patternpatternRegular expression pattern to match against
requiredrequiredThe value must be present2
patternpatternRegular expression pattern to match against
enumselect / optgroup / datalistA list of possible values3

Notes:
1. I'm not sure if surrogate pairs are counted as 1 or 2 characters in none, one or both specifications.
2. There's a slight difference, as the keyword in JSON Schema expects an array of properties that are required in an object. Also, a form component can be present, but not filled in, while a property can only be either present or absent. We can however treat both specs as equal in most cases.
3. In HTML an enumerated value is expressed as an option element. In JSON Schema the array can contain items of any type, including null. The HTML option element can only contain a text string.

Types please

The HTML form constraints API says that the data type for all form components is a text string. That's pretty poor, since it's very likely that data submitted by a form will eventually land in another part of your application, or some kind of data storage, and we're going to need more specific type information. When we have it, we can catch some errors in advance, and in other cases we can automatically convert a value to another type. A natural fit for adding type information to a form component would be in a data attribute called type. Data-type... Wow.

Expressing primitive types is very straightforward:

  • input type="text" data-type="string"
  • input type="number" data-type="number" (through type conversion)
  • input type="checkbox" data-type="boolean" (through the checked attribute)

Things get a bit more complicated for objects and arrays. JSON Schema provides the keywords properties for objects, and items for arrays, which can be nested to express any kind of hierarchical data. However, since the form is the schema, we don't always need to express properties or items, as they will naturally appear from the nesting of form components.


For the sake of being complete, let's at least provide a list of properties that can be expected to appear in an object, or rather, a subform. For arrays it should suffice to supply the minItems and maxItems keywords, which express the expected number of items in an array. Optionally we can supply the uniqueItems keyword that can be used to inform that all items in the array must be unique, but identical values should be rare enough in forms.

Type = Format?

Another inconsistency in HTML forms is that the type attribute actually denotes a format according to JSON Schema. It would be better if this pivotal attribute be renamed to format, while the original would be used to express the actual data type of a value, but since the inconsistency is historical we're stuck with it. When we read format instead of type for input elements, we're pretty close to JSON Schema already.

JSON Schema doesn't limit our imagination considering what formats JSON data can have, so we can use that to our advantage when creating custom form components. So instead of thinking up some custom tag name for a functional web component, you could also just name it input and use the type attribute to semantically express its functionality.

Imports

JSON Schema allows importing schema instances by way of JSON Reference. However, as I've demonstrated in the previous post in this series, form parts may just as easily be imported once the HTML imports specification has become widely adopted.

Unobtrusive javascript!

Now that we've perfectly aligned our constraint attributes with JSON Schema, it's time to start validating our form. We have to write some javascript to do this, but we can finally dance and cheer, because the script is unobtrusive. That's because validation is just an added bonus, and the form will work fine without it.

As I've underlined in part 1 in this series, I consider it good practice to bind events to the top form as much as possible, and use both the type of the event, and the component name or matches(some-css-selector) on the event target, to distinguish relevant situations. This is no different for validation. You can decide if you want to validate only on submission, by inspecting the event type for submit. Or you could inspect the target for a specific form component to validate when a change event is encountered.

For things like styling and custom validation messages, we've got a lot of that built into the browser nowadays. I leave the implementation of custom validation as an exercise for the reader for now, but expect some github-hosted form power from my hand in the near future.

Next up: Forms as state containers Part 4: form generation

On unobtrusive javascript

This is a short rant that was originally part of my HTML forms write-up, but that I took out, because it became too long.

I haven't been writing unobtrusive javascript for at least a decade, and perhaps that was a mistake, but it was also inevitable. In my own defense: it seems HTML is actually quite incoherent... Yup, I just wrote that. I mean, you have tags that denote text levels, similar to those used in text processing software, like paragraph, header, blockquote and (later on) the HTML5 semantic tag set. You also have tags that denote some concrete representational blocks that can be considered part of the document flow, like img and table, or more generic tags that can be used to manipulate document flow, like div or span. The hyperlink is the VIP of the web, and a different beast altogether. Then there's the form, with its controls, which isn't representational at all, but primarily functional.

I think the distinction between representational and functional elements in HTML has been underestimated, and led to the explosion of (very obtrusive) javascript libraries to create class-based widgets. At least they were generic enough, and controllable through a well-described API. The standard that eventually evolved from this, namely web components, is no different. Although it may fill the gap between built-in markup and custom, javascript-enriched bundles of functionality, it's still mainly this functionality we're after, not text layout. Calling a class-based object an HTML element doesn't really change anything.

Finally, there's the argument that we should use meaningful (or semantic) tags. But there's nothing semantic about HTML itself. The semantic tag set that was added to HTML5 denotes text layout, which may be meaningful to browsers and text processing software, but not necessarily to humans. We should take people with disabilities into account when writing web applications, and for that it's helpful to specify the parts of document flow. But it's a formality that can also be controlled through application code. In the worst case, thinking semantically the HTML5 way impairs programmatic control and generics, simply making it harder to automate the process of creating web applications.

Ironically, semantic tagging is no longer the domain of markup, and moves in the direction of data. Yes, we're back where we started: markup and data are again separated. JSON-LD is an initiative that's quite new, but it enables consumers of public web API's to retrieve information in an automated fashion. This means we only have to decide what to automate: or we write programs that generate separate representations for humans and machines, or we write programs that generate a mix of meaning and functionality in HTML, sprinkled with some unobtrusive javascript.

So I'm like, yeah, whatever man, whatever.

Forms as state containers Part 2: managing complex data

This is a small series on HTML forms. The previous post was Forms as state containers Part 1: forms as a single source of truth

JSON?

The value of a form can be seen as a key/value map, a plain object in JSON. From this follows that the value of a subform naturally becomes an object in the parent form. In HTML5 subforms can be expressed as a fieldset element. However, the javascript interface of a fieldset is not a natural fit as a subform, since it only groups elements in the parent form: the elements remain accessible in the parent form. Wrapping the fieldset interface slightly in a thin javascript layer, we can treat fieldsets as a true subform, and we can address its "value" property as an object. In the unlikely event that the form doesn't have access to javascript, we may have have another solution, as we'll see later on.

Repeating values obviously fit naturally to javascript arrays. The "value" property of, for example, a multi-select gets (or sets) the array value of the form component. This is all quite obvious, you might think, and with the advent of JSON you just send your form data as a JSON document to the server with an Ajax request. However, this does not adhere to the age-old adagium of unobtrusive javascript. And although graceful degradation and adherence to standards seems a thing of the past, we may still need to send complex form data over the wire using the built-in mechanisms. Unfortunately, this is not so trivial.

The internet mime type of form data traditionally is by default application/x-www-form-urlencoded, the only other flavor is multipart/form-data. Only these formats are available if you want to send form data to the server using the built-in HTTP methods (AKA "verbs") GET or POST that are available on the HTML form element. There is a solution, however, and it comes from an unlikely place...

Look ma, no JSON!

Originally, PHP was a form processing language for the web, and was even briefly named FI (Form Interpreter). From early on it had a way of handling hierarchical form data on the server. When a field name in a URL-encoded piece of data ends with a matching pair of square brackets, the value is interpreted as an array:

myarray[]=1;myarray[]=2;myarray[]=3

In the receiving PHP script, myarray will contain the following array:

[1,2,3]

If the order is important, indices may be specified:

myarray[2]=1;myarray[1]=2;myarray[0]=3

Translates to:

[3,2,1]

You can even encode key/value maps in post data you send to PHP, by supplying a string instead of an integer within the enclosing square brackets:

mymap[a]=1;mymap[b]=2;mymap[c]=3

On the server, mymap will contain:

{"a": 1, "b": 2, "c": 3}

You've guessed it, you can't encode key/value maps with integers as keys, but that doesn't matter, because you also can't use integers as names for form components.

HTML imports

We successfully resolved the issue of sending complex data to the server, but we didn't even have a use case yet! Let's take the largest challenge I can currently think of head-on: creating an Excel-like tabular data grid for editing rows and columns with arbitrary information. From a form perspective, we have an array of repeatable, yet identical, subform instances, where the values in each subform can be edited by the user. For the moment, I'll ignore properties for individual rows or columns, like custom formatting or data types.

Can we create such a beast using just HTML forms? Well, we could start by introducing a fieldset element, and stating that it should be interpreted as an array:

<form name="datagrid">
  <fieldset name="row[]">
    <!-- this will contain subform components -->
  </fieldset>
</form>

Column names are initially sequential letters starting with A. So for starters, let's just take A to G to create the default subform that will be imported:

<fieldset name="row[]">
  <input name="A" type="text" />
  <input name="B" type="text" />
  <input name="C" type="text" />
  <input name="D" type="text" />
  <input name="E" type="text" />
  <input name="F" type="text" />
  <input name="G" type="text" />
</fieldset>


Eventually we will want to repeat both rows and columns, so we can add them more generically, but to just to drive our point home, I present to you my pure HTML form for editing tabular data:



I didn't even have to use tables. So, how can we repeat subforms without all this copy-pasting I just did? Well, we could just write a simple javascript to repeat rows and columns, but it wouldn't be pure HTML anymore. What if I could at least import the fieldset, as a snippet of code, and insert it into the main form without the aid of custom scripts? At first I thought HTML imports could provide a standard way of doing this, but alas... HTML imports aren't about importing HTML at all! They're for bundling functionality, which requires a lot of javascript. Yikes! However, the idea isn't bad, so I'll just use it as I think it should work. To cut it short, the above snippet is packed as separate form piece, and can be considered the default row.

Once the pieces are put together, we obviously need a way to store actual data, or retrieve what we've already stored before, and populate the form with it. We need some formal way of expressing that we want to retrieve multiple rows, which in terms of data means an array. I usually use the following convention: if the URL contains a query, the result will be an array, and if it doesn't, the URL should contain an identifier, in which case the result will be a key/value map.

So, to insert the data located at /row, we create a query to request an array from the server, and import in into the form:

/row?limit=100

The above URL will just return a hundred rows of data. We could use JSON, as the common conviction is that markup contains a lot more data. But since our intent is to insert subform instances, we should still use plain old HTML, as it's much more predictable. And how much bigger is properly compressed markup really?

Below is a working prototype, provided the links are resolved directly at their current location. In case there's no data saved yet, there's only the empty default row. Once the form is submitted, the data is stored, and in that case there will be one row of data, and again an empty row at the bottom.

<form>
  <link href="/header.html" rel="import"></link>
  <link href="/rows?limit=100" rel="import"></link>
  <link href="/default-row.html" rel="import"></link>
</form>


This is just a very basic prototype, that can be enhanced in many ways, for example with sorting and filtering, operations on single cells, you name it. But before you start hacking right away using your favorite javascript framework of the month, consider starting from traditional HTML forms.

Next up: Forms as state containers Part 3: validation