Friday, October 24, 2014

JSON-M

A Simple JSON Encoding for MUMPS Data


  This post describes a simple data encoding for MUMPS that I'm calling JSON-M. As the name implies, it is based on JSON. JSON is a very popular, lightweight data encoding, derived from JavaScript objects. It was created by Douglas Crockford and stands for JavaScript Object Notation. It is a string representation of JavaScript objects, with stricter formatting rules, necessary to deal with its use as a universal data interchange format. See http://json.org/ for further details about JSON.

  The M in JSON-M stands for MUMPS, which in turn stands for Massachusetts General Hospital Utility Multi-Programming System. M is also an ANSI standard alternative name for MUMPS. M is a little known, little understood language, used heavily in health care and banking software. See http://en.wikipedia.org/wiki/MUMPS for further details about M.

  So now that we know what JSON-M stands for, let's get into the reasons why it is needed. There is really only one data structure in M, which is known as the array, but is really a tree. In fact, it is most often implemented as some kind of B-Tree. It is also what's known as a sparse data structure, in that it does not require every node to contain data, nor even to exist. One of the most powerful features of M is its full language integration with a persistence storage facility, or database, known as globals, which are structured exactly the same as its in-memory arrays, and which are a first class primitive of the language. By simply prepending a caret character (^) to any array, you are now operating with the database on disk, rather than the symbol table in memory. And since they are the same exact data structure, the language provides for easy merging of a tree or sub-tree between local (in-memory) arrays and global (on-disk) arrays, as well as local-to-local and global-to-global merging. This tree-like data structure is incredibly powerful, in part because of how dynamic it is. The language doesn't enforce any kind of template or schema, nor any other kind of predetermined structure on the data. It is created on-the-fly, being shaped as needed. Though this kind of dynamic data structure is very flexible, it comes at the cost of having to build your own data dictionary schemes, and having to write and maintain your own indexes, since the language has no built-in database management system.

  Your first thought might be, "Why not just use JSON to encode M data?" That is exactly what most people who are working with M data in other environments are doing. However, there is a problem with that. JSON is not a rich enough data encoding to represent all of the structural elements of an M global, by default. Specifically, it cannot encode data at a node, while at the same time storing children of that node, in an obvious way. JSON is structured as an object, containing names, called properties, and their associated values. And data is stored as the value of a property, but child objects are also stored as the value of a property. Objects can be arbitrarily nested, but an object property cannot have two different values stored, one an object representing a child node, and the other, data stored at that node directly. It supports arrays, which can also be nested in objects, or other arrays. Remember that M arrays are structured as a tree, and data is stored at nodes in the tree, which are made of zero or more subscripts, and the entire structure of a global is acyclic. So, at any node, you can have children that branch off of that node, and data that is stored at that node. And every part of the language allows you to easily work with, and represent that potentially complex structure. The way that the language handles representation of the storage of its tree-like data allows for it to avoid the problem that JSON has.

  Here is an example of some data in M, and how it would be represented, using the zwrite command in GT.M:

    ^nodem="List of colors"
    ^nodem("color")=10
    ^nodem("color",1)="black"
    ^nodem("color",2)="white"
    ^nodem("color",3)="red"
    ^nodem("color",4)="green"
    ^nodem("color",5)="orange"
    ^nodem("color",6)="purple"
    ^nodem("color",7)="pink"
    ^nodem("color",8)="yellow"
    ^nodem("color",9)="blue"
    ^nodem("color",10)="brown"

  And here is the same data, encoded using one form of JSON:

    {
      "color": [
        "black",
        "white",
        "red",
        "green",
        "orange",
        "purple",
        "pink",
        "yellow",
        "blue",
        "brown" 
      ]
    }

  Notice that the data node at the root of the ^nodem global, which contains the data, "List of colors," as well as the ^nodem("color") node, containing the data, 10, are not represented in the JSON output? This is because they are sitting at nodes that have child nodes, and in JSON, those child nodes will be stored as the value of their parent nodes instead. The "color" property hides the data value, "List of colors," and the array containing the "color" nodes, hides the data value, 10. 

  JSON-M was developed as a solution to this problem. It does not require any extra semantic information, nor does it require custom JSON parsers. It will parse using any standards-compliant JSON parser, as long as you follow two simple rules in how you write your conversion tools between the two formats. The first rule is that you encode any data that is stored at a node which has child nodes, using the null ("") key, as its property name. The null key is then stored inside the nested object, but is understood to be the data value at the point at which the object resides. Therefore, a null property's data value is the data value at its parent's node in M. This works because in standard M, subscripts are not allowed to be a null string (""), because some of the utility functions in the language use the null string to signal the end of the data at a particular branch, or subscript level, in the tree. As a side note, you can set up at least some M implementations to allow for null subscripts, but that is never a good idea, and rarely, if ever, done in practice. The second rule follows from the first rule; don't use arrays to represent data in M that is increasing as sequential integers, as arrays in JSON do not use a key or property name to store values, but rather use the position of the value in the array, like most arrays do. And, of course, there is no null ("") position in an array.

Here is an example of the previous data, encoded using JSON-M:

    {
      "": "List of colors",
      "color": {
        "": 10,
        "1": "black",
        "2": "white",
        "3": "red",
        "4": "green",
        "5": "orange",
        "6": "purple",
        "7": "pink",
        "8": "yellow",
        "9": "blue",
        "10": "brown"
      }
    }

  As you can see, JSON-M is a simple way to encode M data, without losing any of its structure. Hopefully it will be adopted by anyone who writes conversion tools between M and any other systems.