Pour One Out for Bourbon 🥃

TL,DR: Thank You, thoughtbot

This post is a big ol’ Thank You post to the team at thoughtbot (a software consultancy company) for their generosity in creating and managing open source software projects over a lifetime of a project I work on a fair amount these days – PyPI.org, the largest free repository of Python packages.

Open Source Software (OSS) can be a force-multiplier for any software development project, as it reduces the need for the developer to write all of the code necessary to finish a task/project/website/etc. The motivation to share software as free and open for others to use can differ wildly from one person to the next, but the fact that anyone does is much appreciated, thanks again!

History Class

Back in 2013, when Macklemore & Ryan Lewis’ Thrift Shop was all the rage, is when I probably became aware of thoughtbot, as I was building up Datadog’s operations team and practices. We used Opscode Chef to manage the ever-growing complexity of our fleet of cloud instances. The earliest reference I can still dig up is my own blog, The Importance of Dependency Testing (generally still good advice!) and the super helpful tools and blogs they wrote to help folks understand not only the code they shared, but how they used it, and why.
I incorporated one of their tools into one of my own tools’ testing workflows, to help users of my tool have a better experience. It worked as advertised, and I used their tool again in early 2014 to help do the same for more “official” open source software.

While many of us have moved on from Chef software, I’ll never forget the #ChefFriends we made along the way. 🥰

Narrator: Mike wasn’t part of this next bit, the details might be a bit wrong, sorry!

In 2015, the Python Software Foundation (PSF) was engaged in rebuilding PyPI to meet new demands, launching in 2018. Beyond the new underlying functionality, Nicole Harris led the user interface redesign. You can still see her mock prototype and its inner workings – most of which remains active in PyPI today.

A couple of OSS libraries were copied into the codebase back then – bourbon and neat – two projects that made writing Sass (SCSS) style-related code a little easier – both of which are now deprecated. PyPI briefly included the third project, bitters, but removed it prior to production launch as it wasn’t needed.
These tools were especially important during the times of the second browser war, providing developers simpler methods to create layouts, mix in functionality that would create better browser support for end users, despite lack of feature parity across all browsers, leading to a better end-user experience.

PyPI’s new design needed a grid layout, which wasn’t a standard yet, and neat provided clean way to make a grid layout in CSS work properly for browsers of that era (Chrome 10+, Firefox 29+, Internet Explorer 9+, Safari 5.1+, et al). If I’m reading the source correctly, neat had functions to create visual tables and place items correctly. bourbon provided behaviors (mixins and selectors) like @clearfix, and @ellipsis, utilities to prevent the need to repeat a given set of statements. Both libraries without having to write a lot of custom code to work well across browsers.

Back in 2015 this amounted to ~7,000 lines of code, and thanks to thoughtbot’s sharing of libraries they developed internally to help their clients, PyPI has a good user interface, launching to the public in 2018.

Getting into the game

I started contributing to PyPI’s codebase in 2021, with typo fixes, documentation updates, and development environment improvements – common ways I’ve found are usually welcome to OSS projects, as they are easily overlooked by the current maintainers, who are often busy with other things and overlook when an instruction is wrong. I am still occasionally tickled by the fact that my path to working at the Python Software Foundation started with writing JavaScript and CSS.

Some of my early work was dropping support for Internet Explorer 11, and as I learned more about the codebase, I spotted the bourbon/neat inclusions. In early 2023 I mentioned that these were no longer maintained as of late 2019. PyPI was using bourbon 4.2.5 and neat 1.7.2 at that time, both released in early 2015.

Over time, neat 2.0 changed its external-facing API, and bourbon 5.0 did the same for some functions. Due to how bourbon and neat were included in the repository – copied the source, not via an installation package, commonly known as “vendoring” – made updates a little harder to reason about, especially since minor changes had been added to the vendored code over time. (No criticism to the vendoring method, it worked with the tools and approaches for that era!)

PyPI’s versions were effectively “stuck” in 2015, and what was there worked well, but didn’t take advantage of the evolving browser standards that changed the need for these libraries over time.

Do the thing

I recently decided to spend a weekend day to see how much I could remove/replace to continue to keep PyPI style code modern, working, and maintainable.

I learned that the only functionality PyPI still used from neat was @span-columns(), and that could be replaced with a relatively small custom mixin using native CSS Grid (specifically grid-template-columns, grid-column, and column-gap), all widely available since 2018. Replacing ~1,100 lines of neat with this SCSS was pretty straightforward:

$grid-columns: 12;

@mixin grid-container {
  display: grid;
  grid-template-columns: repeat($grid-columns, 1fr);
  column-gap: $half-spacing-unit;
}

An example of the using the replacement mixin

The resulting compiled CSS was effectively the same visual behavior, but now used native CSS styles and is identified as browsers as using grid syntax, which opens up the ability to further transform to more responsive layouts in the future if needed.

On the bourbon side, thoughtbot had produced a Migrating from v4 to v5 guide in 2018, which included helpful suggestions on replacements or adaptations. In the intervening years, bourbon was updated to version 7.3.0, and then finally deprecated, and thoughbot again shared a guide for replacing it with alternative approaches earlier this year.

Using this guide, and now more familiar with the SCSS code used, I was able to update bourbon to the final version, and then fully replace any used functionality remaining with adequate replacements, removing another ~4,400 lines of code from the codebase.

Finishing the job

Testing website visual layouts is a whole discipline on its own, and PyPI doesn’t have a test suite that covers visual layouts, but we have a pretty robust local development environment. I used that to validate that each change I made had the desired affect as I was making them. Thankfully, most of the usages were hyper-specific to a given page, so there were only a handful of places to confirm manually.

Once I was done with all my changes, I captured the compiled distribution CSS file and ran a semantic difference checker against the file from before the changes and after. The open source tool I used was difftastic , which was able to point out the differences between the two compiled, minimized CSS files, and help me confirm that the only changes were the ones I expected. Thanks Wilfred!

Once done, I opened a pull request for review, and it made it out to production with no issues. 🥳

What have we learned, Mike?

Well, first and foremost, open source is continuing to grow, so big thanks to any and all who share their code with the world and take the time to maintain. Truly, we are stronger together.

Let’s give a round of applause to the team at thoughtbot. While I generally write less Ruby/JavaScript/CSS these days, and their code is no longer inside PyPI, I know that when I come across these folks’ tools or libraries, it’s likely to be a cut above others on the quality spectrum. Their approach is what I’d consider exemplary on how to help user navigate changes to open source from “cradle to grave” – providing essential, useful software for others, showing them how to use it, update it, and finally retire it from use. PyPI benefited from these projects for ~9 years, until no longer needed, so we salute them! 🫡

And finally, consider this a call to action to anyone out there to get involved in open source. For some ideas and approaches, check out the chat I had with my fellow AWS Hero Chris Williams last year on vBrownBag:

Thanks for reading, now go out and get involved!

Reduce AWS Lambda Latencies with Keep-Alive in Python

It starts, as many stories do, with a question. On September 10th, AWS Serverless Hero Luc van Donkersgoed shared his observations on the relationship of reduced latency with increased request rate when using AWS Lambda. This is always an interesting conversation, and sure enough other AWS Heroes like myself are curious about some of the outlier behaviors, and what exactly is going into each request. AWS Data Hero Alex DeBrie, AWS Container Hero Vlad Ionescu both ask excellent questions about the setup and the behaviors, leading Luc to share what he’s seeing with regards to DNS lookups that don’t make sense to him.

After asking a couple of more questions of my own, I rolled up my sleeves and dug into the what, how, and why.

Good to know, helps isolate variables. I'll poke at this a little later to see if I can derive any conclusions.
— Mike Fiedler, Code Gardener (@mikefiedler) September 11, 2022

getting ready to read things and hit them with sticks

I dive in to all parts of the stack in use to try and understand why Luc’s code is seeing DNS lookups.
For example, if your function needs to call AWS S3 or a Twilio API, we usually provide the domain name, and have the code or library perform a request to a Domain Name System (DNS) server to return the current IP address, and then communicate using the IP address. This is a network call and can be expensive (in milliseconds) if it’s performed more frequently than the DNS response’s Time To Live (TTL) – kind of like an expiration date. The DNS lookup adds some more latency to your overall call, which is why many systems will cache DNS responses until the TTL is expired, and then make a new call. If you perform DNS lookups when not needed, that’s adding latency unnecessarily. Read the tweet thread for more!

I arrive at two possible solutions:

If the Python code calls more than 10 AWS service endpoints, it will trigger a DNS lookup, as urllib3‘s PoolManager will only maintain 10 connections (set by botocore defaults) and will need to recycle if exceeded.
Since we’re unlikely to be hitting the limit of 10, something else is at play.
I found that the default behavior of boto3 is to not use Keep Alive, thus explaining why the occasional connection is reset, triggering a DNS lookup. (Read the tweet thread for the full discovery.)

Using Keep-Alive is nothing new, and was covered quite well by AWS Serverless Hero Yan Cui back in 2019 for Node. It’s even in the official AWS Documentation, citing Yan’s article for the proposed update. Thanks Yan!

There’s precious little literature on using Keep Alive for Python Lambdas that I could find, leading to issues like Luc’s and reports like this one, so I decided to dig a little further. Knowing now that the default for Keep Alive is off by default for users of the popular boto3 package to interact with AWS services, I wanted to explore what that looks like in practice.

I decided to pattern an app after Yan’s example – a function at receives an event body, and persists it to DynamoDB. All in all, not a too complex operation – we perform a single DNS lookup for the DynamoDB service endpoint, and then use the response IP address to connect over HTTP to put an object into the DynamoDB table.

After re-writing the same function in Python, I was able to test the same kind of behavior that Yan did, running a call to the function once per second, isolating any concurrency concerns, replicating Luc’s test. This should have the benefit of reusing the same Lambda context (no cold starts) and seeing that the latencies range from 7 to 20 milliseconds for the same operation:

filtered log view showing only the latency for `put_item` calls to DynamoDB for 30 seconds

So far, so good – pretty much the same. The overall values are lower than Yan’s original experiment, which I attribute to the entire Lambda ecosystem improving, but we can see there’s variance and we often enter double-digit latencies, when we know that the DynamoDB operation is likely to only take 6-7 milliseconds.

left side shows spiky responses; right side shows most responses are fast, with some slower outliers

As Yan showed in his approach adapted from Matt Levine’s talk snippets, he was able to reconstruct the AWS Config by rebuilding the lowest-level HTTP agent that the library relies on to make the calls, and thereby set the behavior for Keep Alive. This has since been obsoleted by the AWS Node.JS SDK adding an environment variable to enable the keep alive behavior, which is awesome! But what about Python? 🐍

In the recent release of botocore 1.27.84 we can modify the AWS Config passed into the client constructor:

# before:
import boto3
client = boto3.client("dynamodb")

# after:
import boto3
from botocore.config import Config
client = boto3.client("dynamodb", config=Config(tcp_keepalive=True))

With the new configuration in place, if you try this on AWS python3.9 execution runtime, you’ll get this error:
[ERROR] TypeError: Got unexpected keyword argument 'tcp_keepalive'

While the AWS Python runtime includes versions of boto3 and botocore, they do not yet support the new tcp_keepalive parameter – the runtime currently ships:
– boto3 1.20.32
– botocore 1.23.32

So we have to solve another way.

The documentation tells us that we can configure this via a config file in ~/.aws/config, added in version 1.9.17 back in October 2018 – presumably when all the Keep Alive conversations were fresh in folks’ minds.

However, since the Lambda runtime environment disallows writing to that path, we can’t write the config file easily. We might be able to create a custom Docker runtime and place a file in the path, but that’s a bit harder, and we lose some of the benefits of using the AWS prebuilt runtime like startup latency, which when we’re exploring a latency-oriented article, seems like the wrong choice 😁.

Using serverless framework CLI with the serverless-python-requirements (what I’m currently using), or AWS SAM, you can add the updated version of boto3 and botocore, and deploying the updated application allows us to leverage the new setting in a Lambda environment. You may already be using one of these approaches for a more evolved application.
Hopefully 🤞 the Lambda Runtime will be updated to include these versions in the near future, so we don’t have to package these dependencies to get this specific feature.

With the updated packages, we can pass the custom Config with tcp_keepalive enabled (as shown above), and observe more constant performance for the same style of test:

left: much smoother!! right: narrower distribution of values, max 8.50 ms

There’s an open request for the config value to be available via environment variable – check it out and give it a 👍 to add your desire and subscribe via GitHub notifications.

Enjoy lower, more predictable latencies with Keep Alive!

Check out the example code here: https://github.com/miketheman/boto3-keep-alive

Postscript: If you’re interested in pinpointing calls for performance, I recommend checking out Datadog’s APM and associated ddtrace module to see the specifics of every call to AWS endpoints and associated latencies, as well as other parts of your application stack. There’s a slew of other vendors that can help surface these metrics.

Fixing unintended consequences of the past

In the age of technology, everyone races forward to get the win. Anything that can provide you the competitive edge is considering important.
This is especially true in the realm of web media, where optimizing for page load times, providing secure transport, adhering to standards can make a difference in how a site is handled by client browsers, ranked by search engines, and most importantly how it is seen by viewers.

To this end, there are many sites, services and companies that will provide methods to audit a site and point out what could be problematic – count broken links, produce reports of actionable corrections, and more.
Some are better than others, and occasionally, you’ll come across something you’ve never seen before.

Recently, I was pinged about pages on a site that is hosted on an Amazon Simple Storage Service (S3) website-enabled bucket.
Since S3 is an object store only, this means that the pages in this site are statically generated and there is no associated web server, backend database, or other components to serve the pages.

This model is becoming more common for sites that can be simplified to run with no dynamic loading of data from a database, withstand heavy bursts of requests, as well as run cheaply (there’s even a free tier, beyond which pricing still remains affordable).

The idea is that you create your content in one format, run a compiler process to generate all the rendered files containing the links and content, and then upload the the compiled files to the S3 location to be requested by browsers. There are many guides on the web on how to do this – I’m not going to link to any now, search and ye shall find.

This particular site had been deployed since 2011 – and the mechanism to copy compiled files to S3 has been using the popular open source command line tool s3cmd – deployment basically looked like this (and still does!):

 s3cmd sync output/ s3://www.mysite.com

where output/ contains the compiled files, ready for deployment.

This has worked very well for over 4 years – until it came to my attention that when uploading to S3, the s3cmd tool was adding some metadata to each file as it uploaded it, as part of the design to support website hosting on S3.

For instance, when uploading a .css file to S3, s3cmd attempts to determine extra details about the file, and set the correct metadata for browsers to understand, such as Content-Type: text/css.
This is a critical function, as it would be difficult to take the time to determine each file’s content type, set that manually, across many files.
You can read more about content media types on Wikipedia.

Since this project was set up a long time ago, the version of s3cmd used as still in alpha stage – and it was used because it performed well enough, and nothing broke, so we were happy to continue running the with same version since early 2013.

The problem reported to me was that many files on the site were returning an invalid Content-Encoding value, something that has been typically not a problem, as the client’s browser will send an Accept-Encoding header when making a request, typically something along the lines of:

Browser: Hi there! Can I have this resource, and I'll accept a response encoded in the following formats: a, b, or c
Server: No problem! Here's the resource you're looking for, with a content encoded in XYZZY

Now, the XYZZY in this example was being set by the s3cmd upload process, and it was determined to be a bug and fixed in late 2013, but since we never knew about the problem, and the site loads just fine, we never addressed it.
There have been even more stability fixes and releases of s3cmd since – as recently as February 2015.

The particular invalid encodings being set were UTF-8 and ANSI_X3.4-1968. While these are valid encodings for files, they are invalid values for the Content-Encoding field.

Here’s an example of how to show the headers of a particular remote file:

$ curl -sI http://www.mysite.com/static/css/style.css | grep Content
Content-Encoding: ANSI_X3.4-1968
Content-Type: text/css
Content-Length: 7073

Many modern browsers will send something along the lines of ‘Accept-Encoding: gzip, deflate, sdch‘ in their request header, in hopes that the server can respond with one that matches, and then save on overall bytes sent over the wire, to speed up pages.

It’s the responsibility of the client (browser) to handle the response. I looked into the source code of Chromium (the basis for Google Chrome), and can see from here that in my example above, at Content-Encoding type of XYZZY will pretty much be ignored, which in this case, is fine, since we’re sending an invalid type.

So there’s no direct user impact, why should we care? Well, according to some popular ranking engines:

Using non-HTML content types for landing pages results in significantly reduced SEO ranking.

So all of this is fine, cool – update s3cmd tool to a newer version, and upload the output files again? Well, it’s not that simple.

Since during a sync operation, s3cmd determines what files might have changed, and only uploads the changed ones, it doesn’t reset the object metadata, as this is basically a new object, and the file itself hasn’t been changed.

One solution might be to edit every file, add an extra space somewhere – maybe an extra blank line at the end – then compile, deploy the changed files – however this might take too long.

Instead, I decided to solve the problem of iterating over every object in a bucket, and checking to see if it had the incorrect Content-Encoding set, and create a new copy of the file without the heading set.

This was pretty straightforward, once I understood the concept of object immutability – once written, you can’t change it, rather what feels like a change from a user interface actually creates a new version of the object with the new settings/metadata.

I also didn’t want to have to download each file locally and then upload it back to S3 – that it a slow operation, and could result in extra network traffic and disk space consumption.

Instead, I used the AWS SDK for Ruby gem, and came up with a short-and-sweet solution:

The code aims to be short and sweet, and sure enough, post-execution, we get the response without the offending header:

$ curl -sI http://www.mysite.com/static/css/style.css | grep Content
Content-Type: text/css
Content-Length: 7073

This swift diagnosis and resolution would not have been possible had the tooling being used not been open source, as many times I was trying to figure out why something behaved the way it did, and while not being familiar with the code, I could reason enough about how things work in general to apply that reasoning on how I should implement my resolution.

Support open source where possible, and happy hunting!

Tracking application performance on Heroku with Datadog

I thought about using a clickbait title – “You’ll never believe how this guy captures metrics!” – but decided that 99% of these are not worth the time invested in coming up with the catch title.

So instead, I’ll simply talk about what I wanted to, and you be the judge of my title.

Application Performance Monitoring, or APM, is a crazily complex landscape, with an enormous amount of tooling, terminology, and providers looking to get some piece of the action.
There are many vendors, and all have their advantages, as well as disadvantages.

The vendor that I am pretty happy with (and I now work there) is Datadog.

One solution that has caught on quite well for surgical application monitoring is the use of the statsd protocol to send metrics from inside your application to a listener which can then store these metrics for querying later on. This is achieved by placing strategic “emitter” callouts in your code so that they can report metrics during runtime.

Flickr, then Etsy have started these projects, and they have been refined, ported to most languages, and are seeing adoption in companies where a focus on measuring is an important goal.
A blog post on Datadog’s implementation and extension of Statsd was written last year and goes into deeper detail.

One common question has always been “How do I collect metrics from an application running on Heroku with Datadog?”.

And I think we finally have one answer.

The Heroku Dyno container is pretty simple – you wanna run a process? Describe it in a Procfile.
You wanna scale? You tell Heroku to launch more Dynos with the process name, as specified in the Procfile.

However, the actual Dyno is a fairly limited environment by design – the root filesystem is read-only, the only writable area is in the application’s root directory, and disappears when terminated. There’s no sysvinit, upstart or systemd for people to bicker about. Use a Procfile, which is also really simple.

So a challenge to overcome became: “how to install a Datadog Agent package that runs a dogstatsd listener as a second process, inside an environment that is pretty locked down?”

First, we have to install the package. Heroku has a concept of “[buildpacks]”(https://devcenter.heroku.com/articles/buildpacks) that can be used to run compilation steps before adding your application code and launching it. The use of multiple buildpacks is also available, to chain steps together to achieve the desired outcome.

I read the heroku-buildpack-apt and found a bunch of good ideas, and came up with a Datadog-Agent-specific installer buildpack that drops off the package, as well as the needed environment for the runtime.

Now how do I run the listener process alongside my application?

Enter foreman. Foreman, not to be confused with “theforeman“, has long been a great way for application developers writing Heroku-targeted applications to run them locally in a similar manner that they will be run on the remote platform.

Foreman reads the Profile, and runs the processes based on the directives contained inside.

This feature is the one that we leverage to run multiple processes on a single Dyno.

By using foreman inside the Dyno, we are able to tell foreman to run more than one process type at a time, with another Procfile that specifies the startup process for the actual application as well as the dogstatsd listener.

When deploying any code revision, Heroku will read the base Procfile, and run a foreman process inside the Dyno, which will in turn, start up the app & dogstasd.

And while foreman is a Ruby gem, your project may be in Python (use honcho), Go (use forego or goreman) and I’m sure there are others out there. I haven’t found or tested all of them, tell me if they work out for you.

I did, however, take the time to write up a README with the procedure to follow to use this, as well as commit-by-commit example application.

Here’s the buildpack code: http://miketheman.github.io/heroku-buildpack-datadog/

Here’s the example application: https://github.com/miketheman/buildpack-example-ruby

Here’s an image of the stats collected by the example application in Datadog, with increasing web load:

Here’s a random dog:

Hope this helps you find deeper insight into how you monitor your applications!

Update (2014-12-15)

A quick addition on this topic.

A couple of days after this was published, I had a short Twitter exchange with Bo Jeanes, after which he submitted a Pull Request to the buildpack, (as well as an update to the example app).
This simplifies the end-user’s deployment of the Agent package, in that the user no longer has to spend any time on doing Procfile-in-Procfile solutions, as well as remove the need from foreman and the like from inside the container, rather the dogstatsd process will be started via the profile.d mechanism which is run on Dyno startup.

This makes the solution even more elegant, so thanks a ton, Bo!

A Quick Drop Into Data Structures For A Minute

So here’s the story, from A to Z…

Well, I’m not going to all the way to Z, but let me lay some details on you.

At Datadog, we provide a nice interface for configuring the Datadog Agent – it’s usually pretty simple to drop some YAML configuration into a file at a specific location, restart the Agent main process, and voilÃ , you’ve got monitoring.

This gets more complicated when you want to generate a valid YAML file from another system, typically from something like Configuration Management, where you want to take the notion of “Things I know about this particular system” should then trigger “monitor this system with the things I know about it”.

In the popular open source config management system Chef, it is a common practice to create a template of the file you wish to place on a given system, and then extract particular variables to pass to a template ‘resource’, and use those as dynamic values that can make the template reusable across systems and projects, as the template itself can be populated by inputs not included in the initial template design.

Another concept in Chef is the ability to set node ‘attributes’ to control the behavior of recipes, templates and any amount of resources. This has pros and cons, neither of which I will attempt to cover here, but suffice it to say that the pattern is well-established that if you want to share your resources with others, having a mechanism of “tweaking the knobs” of your resources with attributes is a common way of doing it.

In the datadog cookbook for Chef, we provide an interface just like this. An end user can build up a list of structured data entries made up of hash objects (or maps or dicts, depending on your favorite language), and then pass that into a node object, and expect that these details will be rendered into a configuration file template (and restart the service, etc).

This allows the end user to take the code, not modify it at all, and provide inputs to it to receive the desired state.

Jumping further into Chef’s handling of node attributes now.

== Attribute
Attribute implements a nested key-value (Hash) and flat collection
(Array) data structure supporting multiple levels of precedence, such
that a given key may have multiple values internally, but will only
return the highest precedence value when reading.

Attributes are subclassed of the Mash object type – which has some cool features, like deep-merging lower data structures – and then attributes are compiled together to make collections of these node attribute objects, which are then “frozen” into another class type named Chef::Node::ImmutableArray or Chef::Node::ImmutableHash to prevent further mucking around with them.

All this is cool so far, and is really useful in most cases.

In my case, I want to allow the user to provide the data needed, and then have the data written our, or deserialized, into a configuration file, which can then be read by the Agent process.

The simple way you might think to do this is to tell the YAML module of Ruby’s standard library (which is actually an alias to the Psych module) to emit the structured YAML and be done with it.

In an Erubis (ERB) template, this would look like this:

<%= YAML.dump(array_of_mash_data) %>

However, I’d like to inject a header to the array before rendering it, so I’ll do that first:

<%= YAML.dump({ 'instances' => array_of_mash_data }) %>

What this does is render a file like so:

---
instances:
- !ruby/hash:Mash
  host: localhost
  port: 9999
  extra_key: extra_val
  conf:
  - !ruby/hash:Mash
    include: !ruby/hash:Mash
      domain: org.apache.cassandra.db
      attributes:
      - BloomFilterDiskSpaceUsed
      - Capacity
      foo: bar
    exclude:
    - !ruby/hash:Mash
      domain: evil_domain

As you can see, there’s these pesky lines that include a special YAML-oriented tag that start with exclamation points – !ruby/hash:Mash – these are there to describe the data structure to any YAML loader, saying “hey, the thing you’re about to load is an instance of XYZ, not an array, hash, string or integer”.

Unfortunately, when parsing this file from the Python side of things to load it in the Agent, we get some unhappiness:

$ sudo service datadog-agent configcheck
your.yaml contains errors:
    could not determine a constructor for the tag '!ruby/hash:Mash'
  in "<byte string>", line 7, column 5

So it’s pretty apparent that I can do one of two things:

teach Python how to interpret a Ruby + Mash constructor
figure out how to remove these from being rendered

The latter seemed most likely, since I didn’t really want to teach Python anything new, especially since this is really a Hash (or a dict, in pythonese).

So I experimented with taking items from the Mash, and running them through a built-in method to_hash – which seemed likely to work.

Not really.

<%= YAML.dump({ 'instances' => @instances.map { |item| item.to_hash }}) %>

That code only steps into the first layer of the data structure and converts the segment starting with host: localhost into a Hash, but the sub-keys remain Mash objects. Grr.

Digging around, I found other reported problems where people have extended Chef objects with some interesting methods trying to solve the same problem.

This means that I’d have to add library code to my project, then modify the template renderer to make the helper code available, then tell the template to render it using these subclassed methods, and then have to worry about it.

ARGH.

Instead, I tried another tactic, which seems to have worked out pretty well.

Instead of trying to walk any size of a data structure and attempt to catch every leaf of the tree, I turned instead to another mechanism to “strip” out the Ruby-specific data structure details, and keep the same structure, so I used the ol’ faithful – JSON.

By using built-ins to convert the Mash to a JSON string, then have the JSON library parse it back into a datastructure, and then serialize it to YAML, we remove all of the extras from the picture, leaving us with a slightly modified ERB method:

<%= JSON.parse(({ 'instances' => @instances }).to_json).to_yaml %>

I then took to benchmarking both methods to see if there would be any significant impact on performance for doing this. Details are over here. Short story: not much impact.

So I’m pretty happy with the way this turned out, and even if I’m moving objects back and forth between serialization formats, the end result is something the next program (Datadog Agent) can consume.

Hope you enjoyed!

The Importance of Dependency Testing

Recently I revisited an Open Source project I started over a year ago.

This tool is built to hook into a much larger framework (Chef), and leverages a bunch of code many other people have written, and produce a specific result that I was looking for.

This subject is less about the tool itself, rather the process and procedure involved in testing dependencies.

This project is written in Ruby, and as many have identified in articles and tweets, some project maintainers don’t adhere to a versioning policy, making it hard to ensure working software across multiple versions of dependencies.
A lot rides on the maintainer’s adherence to a versioning standard – one very popular one is Semantic Versioning, or SemVer for short.

This introduces a few other questions, like how frequently should a writer release new versions of code, how frequently should users upgrade to leverage new fixes, features, etc.

In any case, my tool was restricted to running the framework’s version 10.x, considering that between major versions, functionality may change, and that there is no guarantee that my tool will continue working.

A new major version of Chef was released earlier this year and most of my existing projects are still on Chef 10.x, as this is still being updated with stability fixes and security patches, and the ‘jump’ to 11 is not on the schedule right now, so my tool continues functioning just fine.

Time passes, and I have a project running Chef 11 that I want to use my tool with.

Whoops. There’s a constraint built in to the tool’s syntax of dependencies that will report that “you have Chef 11, this wants Chef 10.x and not higher, have a nice day”.

So I change the constraint, install locally, and see that it works. Yay!

Now I want to commit the change that I made to the version constraint logic, but I want to continue testing the tool against the 10.x versions, as I should continue to support the active versions for as long as they are alive and in use, right?

A practice I was using for the tests that I had written was: given a static list of Chef versions, use the static entry as the Chef version for installation/test.

This required me to update the static list each time a new version of Chef was released, and potentially was testing against versions that didn’t need testing – rather I wanted to test against the latest of the mainline release.

I updated my constraint, ran the test suite that I’ve written, and whoops, it failed the tests.

Functionality-wise, it worked correctly on both versions, so the problem must be in my test suite, right?

I found a cool project called Appraisal, that’s been around for a while, and
used by a bunch of other projects, and you can read more about it here.
It allows one to specify multiple version constraints and test against each of them with the same test suite.

Sure enough, passes on version 10, not version 11. Same code, same tests. #wat

So now it’s time to do some digging. I read through some of the Chef ChangeLog, and decide there’s too much to wade through there, rather let’s take a look at the code my tool is using.

The failure was triggering here, and was showing a default value.
This meant that Chef was no longer loading the configuration file that I provide here correctly.

So I took a look at the current version of the configuration loader, and visually compared it with the 10.x version.
Sure enough, there’s one small change that’s affecting me: Old vs New

working_directory? What’s this? Oh, it’s over here, just a few lines prior.

Reading the full commit, and the original ticket, it seems like this is indeed a good idea, but why are my tests failing?

After further digging around in the aruba test suite extension I’m using, I realize that the environment variable PWD remains set to the actual working directory of my shell, not the test suite’s subprocesses.
Thus every time it runs, the chef_config_dir is looking in my current directory, not the directory the tests are running in.

After poking around aruba’s source code, and adding some debugging statements during test runs, I figured out that I need the test suite to change it’s PWD environment variable based on the test’s execution, which led to this commit.

Why is this different? Well, before, Ruby’s Dir.pwd statement would be invoked from inside the running test, loading the config from a location relative to Dir.pwd, where I was placing the test config file.
Now the test was trying to load the config from the process’ environment variable PWD instead, and failing to find the config.

Tests, pass, and now I can have Travis CI continue to test my code with multiple dependencies when it changes and catch things before they go badly.

All in all, an odd behavior to expect in a normal situation, as my tool is mean to be run interactively by a user, not via a test suite that mocks up all sorts of other environments.

So I spent about 2-3 hours digging around to essentially change one line that makes things work better and cleaner than before.

Worth it? Completely, as these changes will allow me to continue to ensure that my tool remains working with upstream releases of the framework, and maintain compatibility with supported versions of the framework.

TL, DR: Don’t skimp on testing you project against multiple versions of external dependencies, especially when your target users are going to be using more than one possible version.

P.S. Shout out to my girlfriend that generously lets me spend time hacking on these kind of things 😀