A Quick Drop Into Data Structures For A Minute

So here’s the story, from A to Z…

Well, I’m not going to all the way to Z, but let me lay some details on you.

At Datadog, we provide a nice interface for configuring the Datadog Agent – it’s usually pretty simple to drop some YAML configuration into a file at a specific location, restart the Agent main process, and voilà, you’ve got monitoring.

This gets more complicated when you want to generate a valid YAML file from another system, typically from something like Configuration Management, where you want to take the notion of “Things I know about this particular system” should then trigger “monitor this system with the things I know about it”.

In the popular open source config management system Chef, it is a common practice to create a template of the file you wish to place on a given system, and then extract particular variables to pass to a template ‘resource’, and use those as dynamic values that can make the template reusable across systems and projects, as the template itself can be populated by inputs not included in the initial template design.

Another concept in Chef is the ability to set node ‘attributes’ to control the behavior of recipes, templates and any amount of resources. This has pros and cons, neither of which I will attempt to cover here, but suffice it to say that the pattern is well-established that if you want to share your resources with others, having a mechanism of “tweaking the knobs” of your resources with attributes is a common way of doing it.

In the datadog cookbook for Chef, we provide an interface just like this. An end user can build up a list of structured data entries made up of hash objects (or maps or dicts, depending on your favorite language), and then pass that into a node object, and expect that these details will be rendered into a configuration file template (and restart the service, etc).

This allows the end user to take the code, not modify it at all, and provide inputs to it to receive the desired state.

Jumping further into Chef’s handling of node attributes now.

== Attribute
Attribute implements a nested key-value (Hash) and flat collection
(Array) data structure supporting multiple levels of precedence, such
that a given key may have multiple values internally, but will only
return the highest precedence value when reading.

Attributes are subclassed of the Mash object type – which has some cool features, like deep-merging lower data structures – and then attributes are compiled together to make collections of these node attribute objects, which are then “frozen” into another class type named Chef::Node::ImmutableArray or Chef::Node::ImmutableHash to prevent further mucking around with them.

All this is cool so far, and is really useful in most cases.

In my case, I want to allow the user to provide the data needed, and then have the data written our, or deserialized, into a configuration file, which can then be read by the Agent process.

The simple way you might think to do this is to tell the YAML module of Ruby’s standard library (which is actually an alias to the Psych module) to emit the structured YAML and be done with it.

In an Erubis (ERB) template, this would look like this:

<%= YAML.dump(array_of_mash_data) %>

However, I’d like to inject a header to the array before rendering it, so I’ll do that first:

<%= YAML.dump({ 'instances' => array_of_mash_data }) %>

What this does is render a file like so:

---
instances:
- !ruby/hash:Mash
  host: localhost
  port: 9999
  extra_key: extra_val
  conf:
  - !ruby/hash:Mash
    include: !ruby/hash:Mash
      domain: org.apache.cassandra.db
      attributes:
      - BloomFilterDiskSpaceUsed
      - Capacity
      foo: bar
    exclude:
    - !ruby/hash:Mash
      domain: evil_domain

As you can see, there’s these pesky lines that include a special YAML-oriented tag that start with exclamation points – !ruby/hash:Mash – these are there to describe the data structure to any YAML loader, saying “hey, the thing you’re about to load is an instance of XYZ, not an array, hash, string or integer”.

Unfortunately, when parsing this file from the Python side of things to load it in the Agent, we get some unhappiness:

$ sudo service datadog-agent configcheck
your.yaml contains errors:
    could not determine a constructor for the tag '!ruby/hash:Mash'
  in "<byte string>", line 7, column 5

So it’s pretty apparent that I can do one of two things:

  • teach Python how to interpret a Ruby + Mash constructor
  • figure out how to remove these from being rendered

The latter seemed most likely, since I didn’t really want to teach Python anything new, especially since this is really a Hash (or a dict, in pythonese).

So I experimented with taking items from the Mash, and running them through a built-in method to_hash – which seemed likely to work.

Not really.

<%= YAML.dump({ 'instances' => @instances.map { |item| item.to_hash }}) %>

That code only steps into the first layer of the data structure and converts the segment starting with host: localhost into a Hash, but the sub-keys remain Mash objects. Grr.

Digging around, I found other reported problems where people have extended Chef objects with some interesting methods trying to solve the same problem.

This means that I’d have to add library code to my project, then modify the template renderer to make the helper code available, then tell the template to render it using these subclassed methods, and then have to worry about it.

ARGH.

Instead, I tried another tactic, which seems to have worked out pretty well.

Instead of trying to walk any size of a data structure and attempt to catch every leaf of the tree, I turned instead to another mechanism to “strip” out the Ruby-specific data structure details, and keep the same structure, so I used the ol’ faithful – JSON.

By using built-ins to convert the Mash to a JSON string, then have the JSON library parse it back into a datastructure, and then serialize it to YAML, we remove all of the extras from the picture, leaving us with a slightly modified ERB method:

<%= JSON.parse(({ 'instances' => @instances }).to_json).to_yaml %>

I then took to benchmarking both methods to see if there would be any significant impact on performance for doing this. Details are over here. Short story: not much impact.

So I’m pretty happy with the way this turned out, and even if I’m moving objects back and forth between serialization formats, the end result is something the next program (Datadog Agent) can consume.

Hope you enjoyed!