Dan Newcome, blog

I'm bringing cyber back

Text transformation with JSON and regular expressions

leave a comment »

Ever since I wrote the Jath Javascript XML processing library I’ve been thinking about ways to declaratively transform various things to JSON.

Perhaps not-so-coincidentally, I’ve been talking about ways to update the tools we use to pass around data structures rather than text blobs lately.

Since I wrote that post, I’ve been toying with small stepping stones that would get us closer to realizing a more “functional” experience in the Linux shell without ripping everything out and starting over. A start is to have a way to get the output of common shell commands and parse it into something like JSON.

As I said at the top of this post, text transformations have been something I’ve been playing with in various forms for a while now, so I dug up a project that I started back when I was working on Jath that does essentially what Jath does but with regular expressions instead of XPath queries. The idea is to provide a template that transforms plain text to JSON using regexps as the selectors.

For those of you not familiar with Jath, it uses a template like this:

var template = [ "//status", { id: "@id", message: "message" } ];

To turn this:

<statuses userid="djn">
    <status id="1">
        <message>Hello</message>
    </status>
    <status id="3">
        <message>Goodbye</message>
    </status>
</statuses>

into JSON like this:

[ 
    { id: "1", message: "Hello" }, 
    { id: "3", message: "Goodbye" } 
]

Keeping that same idea in mind, let’s look at the output of a common Unix utility, ifconfig:

dan@X200:~/$ ifconfig
eth0      Link encap:Ethernet  HWaddr 00:1f:16:15:2e:b1  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:20 Memory:f2600000-f2620000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:550973 errors:0 dropped:0 overruns:0 frame:0
          TX packets:550973 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:70904676 (70.9 MB)  TX bytes:70904676 (70.9 MB)

I’d like to turn this into something like the following:

[ 
    { "Link encap": "Ethernet", "RX packets": "0" }, 
    { "Link encap": "Local Loopback", "RX packets": "550973" } 
]

As a first approximation a template for the above transformation would look something like this:

[ "\n\n", {
        "RX packets": /RX packets:([^\s]*)/,
        "Link encap": /Link encap:([^\s]*)/ 
} ]

I have a little proof-of-concept code that works using the above template, but I’m thinking that the template format could be improved. It seems like a bit much to ask to produce such ugly regular expressions for the template selectors.

Also, maybe we’d want support for a hash object instead of an array. This is something that Jath can’t really do and I never thought about supporting it before. That is, the result collection might look like this:

{
    "eth0": { "Link encap": "Ethernet", "RX packets": "0" }, 
    "lo": { "Link encap": "Local Loopback", "RX packets": "550973" } 
}

Maybe a template format for a hash like this is some kind of glob-style expansion for the key name:

{ "/\b(.+?)\b[\S\s]*\n\n"/g: {
        "RX packets": /RX packets:([^\s]*)/,
        "Link encap": /Link encap:([^\s]*)/ 
} }

The regular expression for finding the keys is a little rough in this case. In order to make this work we are relying on a global regex that we can keep calling exec() on to find the next match until they are exhausted. We only want the capture group, so this would look something like the following:

while( var res = re.exec(input)[1] ) {
    push returnval( res );
}

This stuff is rough for the moment, but I think it could be an interesting way to process arbitrary textual command output in a declarative way by mostly re-using existing solutions like regular expressions.

Advertisements

Written by newcome

March 13, 2012 at 1:24 am

Posted in Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: