Extending ECS Auto-scaling for under $2/month with Lambda

The Problem

Amazon Web Services (AWS) is pretty cool. You ought to know that by now. if you don’t, take a few hours and check out some tutorials and play around.

One of the many services AWS provides is the EC2 Container Service (ECS), where the scheduling and lifecycle management of running Docker containers is handled by the ECS control plane (probably magic cooked up in Seattle over coffee or in Dublin over a pint or seven).

You can read all about its launch here.

One missing feature from the ECS offering in comparison to other container schedulers was the concept of scheduling a service to be run on each host in a cluster, such as a logging or monitoring agent.
This feature allows clusters to grow or shrink and still have the correct services running on each node.

A published workaround was to have each node individually run an instance of the defined task on startup, which works pretty well.

The downside here is is that if a task definition changes, ECS has no way of triggering an update to the running tasks – normal services will stop then start the task with a new definition, and use your logic to maintain some degree of uptime.
To achieve the update, one must terminate/replace the entire ECS Container Instance (the EC2 host) and if you’re using AutoScalingGroups, get a fresh node with the updated task.

Other Solutions

  • Docker Swarm calls this a global service, and will run one instance of the service on every node.
  • Mesos’ Marathon doesn’t support this yet either, and is in deep discussion on GitHub on how to implement this in their constraints syntax.
  • Kubernetes has a DaemonSet to run a pod on each node.
  • The recently-released ECS-focused Blox provides a daemon-scheduler to accomplish this, but brings along extra components to accomplish the scheduling.

Back to ECS

So imagine my excitement when the ECS team announced the release of their new Task Placement Strategies last week, offering a “One Task Per Host” strategy as part of the Service declaration.
This indeed is awesome and works as advertised, with no extra components, installs, schedulers, etc.

However! Currently each Service requires a “Desired Count” parameter of how many instances of this service you want to run in the cluster.

Given a cluster with 5 ECS Container Instance hosts, setting the Desired Count to 5 ensures that one runs on each host, provided there are resources available (cpu, ram, available port).

If the cluster grows to 6 (autoscaling, manually adding, etc), there’s nothing in the Service definition that will increase the desired count to 6, so this solution is actually worse off than our previous mode of using user-data to run the task at startup.

One approach is to arbitrarily raise the Desired count to a very high number, such as 100 for this cluster, with the consideration that we are unlikely to grow the cluster to this size without realizing it.
The scheduler will periodically examine the cluster for placement, and handle any hosts missing the service.

The problem with this is that it’s not deterministic, and CloudWatch metrics will report these unplaced tasks as Pending, and I have alarms to notify me if tasks aren’t placed in clusters, as this can point to a resource allocation mismatch.

Enter The Players

To accomplish an automated service desired count, we must use some elements to “glue” a few of the systems together with our custom logic.

Here’s a sequence diagram of the conceptual flow between the components.

UML Sequence Flow

Every time there is a change in an ECS Cluster, CloudWatch Events will receive a payload.
Based on a rule we craft to select events classified as “Container Instance State Change”, CW Events will emit an event to the target of your choice, in our case, Lambda.

We could feasibly use a cron-like schedule to fire this every N minutes to inspect, evaluate, and remediate a semi-static set of services/cluster, but having a system that is reactive to change feels preferable to poll/test/repair.

A simple rule that captures all Container Instance changes:

{
  "source": [
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Container Instance State Change"
  ]
}

You can restrict this to specific clusters by adding the cluster’s ARN to the keys like so:

  "detail": {
    "clusterArn": [
      "arn:aws:ecs:us-east-1:123456789012:cluster/my-specific-cluster",
      "arn:aws:ecs:us-east-1:123456789012:cluster/another-cluster"
    ]
  }

If being throttled or cost is a concern here, you may wish to filter to a set of known clusters, but this reduces the reactiveness of the logic to new clusters being brought online.

The Actual Logic

The Lambda function receives the event, performs some basic validation checks to ensure it has enough details to proceed, and then makes a single API call to the ECS endpoint to find our specified service in the cluster that fired the change event.

If no such service is found, we terminate now, and move on.

If the cluster does indeed have this service defined, then we perform another API call to describe the count of registered container instances, and compare that with the value we already have from the service definition call.

If there’s a mismatch, we perform a final third API call to adjust the service definition’s desired task count.

All in all, a maximum total of 3 possible API calls, usually in under 300ms.

In my environment, I want this task to apply to every cluster in my account, as we later on inspect the cluster to see if it has a service definition applied to it, to act upon.
In my ballpark figures with a set of 10 active clusters, the cost for running this logic should be under $2/month – yes, two dollars a month to ensure your cluster has the correct number of tasks for a given service.
Do you own estimation with the Lambda Pricing Calculator.

Conclusions

The code can be found on GitHub, and was developed with test-everything philosophy, where I spent a large amount of time learning how to actually write the code and tests elegantly.
Writing out all of the tests and sequences allowed me to find multiple points of refactoring and increased efficiency from my first implementation, leading to a much cleaner solution.
Taking on a project like this is a great way to increase one’s own technical prowess, leading to the ability to reason about other problems.

While I strongly believe that this feature should be part of the ECS platform and not require any client-side intervention, the ability to take the current offerings and extend them via mechanisms such as Events, Lambda and API calls further demonstrates the flexibility and extensibility of the AWS ecosystem.
The feature launched just over a week ago, and I’ve been able to put together an acceptable solution on my own, using the documentation, tooling, and infrastructure while minimizing costs and making my system more reactive to change.

I look forward to what else the ECS, Lambda and CloudWatch Events team cook up in the future!

Setting Up a Datadog-to-AWS Integration

When approaching a new service provider, sometimes it can be confusing on how to get set up to best communicate with them – some processes involve multiple steps, multiple interfaces, confusing terminology, and

Amazon Web Services is an amazing cloud services provider, and in order to allow access informational services inside a customer’s account, a couple of known mechanisms exist to delegate access:

  • Account Keys, where you generate a key and secret and share them. The other party stores these (usually in either clear text or using reversible encryption) and uses them as needed to make API calls
  • Role Delegation, where you create a Role and shared secret to provide to a the external service provider, who then is allowed to use their own internal security credentials to request temporary access to your account’s resources via API calls

In the former model, the keys are exchanged once, and once out of your immediate domain, you have little idea what happens to them.
In the latter, a rule is put into place that requires ongoing authenticated access to request assumption of a known role with a shared secret.

Luckily, in both scenarios, a restrictive IAM Policy is in place that allows only the actions you’ve decided to allow ahead of time.

Setting up the desired access is made simpler by having good documentation on how to do this manually. In this modern era, we likely want to keep our infrastructure as code where possible, as well as have a mechanism to apply the rules and test later if they are still valid.

Here’s a quick example I cooked up using Terraform, a new, popular tool to compose cloud infrastructure as code and execute to create the desired state.
[gist https://gist.github.com/miketheman/72197ec28bd527137e196054b3ab6dec#file-datadog-role-delegation-tf /]

The output should look a lot like this:

[gist https://gist.github.com/miketheman/72197ec28bd527137e196054b3ab6dec#file-output-sh-session /]

The Account ID is actually a full ARN, and you can copy your Account ID from there.
Terraform doesn’t have a mechanism to emit only the Account ID yet – so if you have some ideas, contribute!

Use the Account ID, Role Name and External ID and paste those into the Datadog Integrations dialog, after selecting Role Delegation. This will immediately validate that the permissions are correct, and return an error otherwise.

Don’t forget to click “Install Integration” when you’re done (it’s at the very bottom of the screen).

Now metrics and events will be collected by Datadog from any allowed AWS services, and you can keep this setup instruction in any revision system of your choice.

P.S. I tried to set this up via CloudFormation (Sparkleformation, too!). I ended up writing it “freehand” and took more than 3 times as long to get similar functionality.

You can see the CloudFormation Stack here, and decide which works for you.


Further reading:

Counts are good, States are better

Datadog is great at pulling in large amounts of metrics, and provides a web-based platform to explore, find, and monitor a variety of systems.

One such system integration is PostgresQL (aka ‘Postgres’, ‘PG’) – a popular Open Source object-relational database system, ranking #4 in its class (at the time of this writing), with over 15 years of active development, and an impressive list of featured users.
It’s been on an upwards trend for the past couple of years, fueled in part by Heroku Postgres, and has spun up entire companies supporting running Postgres, as well as Amazon Web Services providing PG as one of their engines in their RDS offering.

It’s awesome at a lot of things that I won’t get into here, but it definitely my go-to choice for relational data.

One of the hardest parts of any system is determining whether the current state of the system is better or worse than before, and tracking down the whys, hows and wheres it got to a worse state.

That’s where Datadog comes in – the Datadog Agent has included PG support since 2011, and over the past 5 years, has progressively improved and updated the mechanisms by which metrics are collected. Read a summary here.

Let’s Focus

Postgres has a large number of metrics associated with it, and there’s much to learn from each.

The one metric that I’m focusing on today is the “connections” metric.

By establishing a periodic collection of the count of connections, we can examine the data points over time and draw lines to show the values.
This is built-in to the current Agent code, named postgresql.connections in Datadog, by selecting the value of the numbackends column from the pg_stat_database table.

01-default-connections

Another two metrics exist, introduced into the code around 2014, that assist with using the counts reported with alerting.
These are postgresql.max_connections and postgresql.percent_usage_connections.

(Note: Changing PG’s max_connections value requires a server restart and in a replication cluster has other implications.)

The latter, percent_usage_connections, is a calculated value, returning ‘current / max’, which you could compute yourself in an alert definition if you wanted to account for other variables.
It is normally sufficient for these purposes.

02-pct_used-connections

A value of postgresql.percent_usage_connections:0.15 tells us that we’re using 15% of our maximum allowable connections. If this hits 1, then we will receive this kind of response from PG:

FATAL: too many connections for role...

And you likely have a Sad Day for a bit after that.

Setting an alert threshold at 0.85 – or a Change Alert to watch the percent change in the values over the previous time window – should prompt an operator to investigate the cause of the connections increase.
This can happen for a variety of reasons such as configuration errors, SQL queries with too-long timeouts, and a host of other possibilities, but at least we’ll know before that Sad Day hits.

Large Connection Counts

If you’ve launched your application, and nobody uses it, you’ll have very low connection counts, you’ll be fine. #dadjoke

If your application is scaling up, you are probably running more instances of said application, and if it uses the database (which is likely), the increase in connections to the database is typically linear with the count of running applications.

Some PG drivers offer connection pooling to the app layer, so as methods execute, instead of opening a fresh connection to the database (which is an expensive operation), the app maintains some amount of “persistent connections” to the database, and the methods can use one of the existing connections to communicate with PG.

This works for a while, especially if the driver can handle application concurrency, and if the overall count of application servers remains low.

The Postgres Wiki has an article on handling the number of database connections, in which the topic of a connection pooler comes up.
An excerpt:

If you look at any graph of PostgreSQL performance with number of connections on the x axis
and tps on the y access [sic] (with nothing else changing), you will see performance climb as
connections rise until you hit saturation, and then you have a “knee” after which performance
falls off.

The need for connection pooling is well established, and the decision to not have this part of core is spelled out in the article.

So we install a PG connection pooler, like PGBouncer (or pgpool, or something else), configure it to connect to PG, and point our apps at the pooler.

In doing so, we configure the pooler to establish some amount of connections to PG, so that when an application requests a connection, it can receive one speedily.

Interlude: Is Idle a Problem?

Over the past 4 years, I’ve heard the topic raised again and again:

If the max_connections is set in the thousands, and the majority of them are in idle state,
is that bad?

Let’s say that we have 10 poolers, and each establishes 100 connections to PG, for a max of 1000. These poolers serve some large number of application servers, but have the 1000 connections at-the-ready for any application request.

It is entirely possible that most of the time, a significant portion of these established connections are idle.

You can see a given connection’s state in the pg_stat_activity table, with a query like this:

SELECT datname, state, COUNT(state)
FROM pg_stat_activity
GROUP BY datname, state
HAVING COUNT(state) > 0;

A sample output from my local dev database that’s not doing much:

datname  | state  | count
---------+--------+-------
postgres | active |     1
postgres | idle   |     2
(2 rows)

We can see that there is a single active connection to the postgres database (that’s me!) and two idle connections from a recent application interaction.

If it’s idle, is it harming anyone?

A similar question was asked on the PG Mailing List in 2015, to which Tom Lane responds to the topic of idle: (see link for full quote):

Those connections have to be examined when gathering snapshot information, since you don’t know that they’re idle until you look.
So the cost of taking a snapshot is proportional to the total number of connections, even when most are idle.
This sort of situation is known to aggravate contention for the ProcArrayLock, which is a performance bottleneck if you’ve got lots of CPUs.

So we now know why idling connections can impact performance, despite not doing anything, especially with modern DBs that we scale up to multi-CPU instances.

Back to the show!

Post-Pooling Idling

Now that we know that high connection counts are bad, and we are able to cut the total count of connections with pooling strategies, we must ask ourselves – how many connections do we actually need to have established, yet not have a high count of idling connections that impact performance.

We could log in, run the SELECT statement from before, and inspect the output, or we could add this to our Datadog monitoring, and trend it over time.

The Agent docs show how to write an Agnet Check, and you could follow the current postgres.py to write another custom check, or you could use the nifty custom_metrics syntax in the default postgres.yaml to extend the check to perform more checks.

Here’s an example:

custom_metrics:
  - # Postgres Connection state
    descriptors:
      - [datname, database]
      - [state, state]
    metrics:
      COUNT(state): [postgresql.connection_state, GAUGE]
    query: >
      SELECT datname, state, %s FROM pg_stat_activity
      GROUP BY datname, state HAVING COUNT(state) > 0;
    relation: false

Wait, what was that?

Let me explain each key in this, in an order that made sense to me, instead of alphabetically.

  • relation: false informs the check to perform this once per collection, not against each of any specified tables (relations) that are part of this database entry in the configuration.
  • query: This is pretty similar to our manual SELECT, with one key differentiation – the %s informs the query to replace this with the contents of the metrics key.
  • metrics: For each entry in here, the query will be run, substituting the key into the query. The metric name and type are specified in the value.
  • descriptors: Each column returned has a name, and here’s how we convert the returned name to a tag on the metric.

Placing this config section in our postgres.yaml file and restarting the Agent gives us the ability to define a query like this in a graph:

sum:postgresql.connection_state{*} by {state}

03-conn_state-by-state

As can be seen in this graph, the majority of my connections are idling, so I might want to re-examine my configuration settings on application or pooler configuration.

Who done it?

Let’s take this one step further, and ask ourselves – now that we know the state of each connection, how might we determine which of our many applications connecting to PG is idling, and target our efforts?

As luck would have it, back in PG 8.5, a change was added to allow for clients to set an application_name value during the connection, and this value would be available in our pg_stat_activity table, as well as in logs.

This typically involves setting a configuration value at connection startup. In Django, this might be done with:

DATABASES = {
  'default': {
    'ENGINE': 'django.db.backends.postgresql',
    ...
    'OPTIONS': {
      'application_name': 'myapp',
    }
    ...

No matter what client library you’re using, most have the facility to pass extra arguments along, some in the form of a database connection URI, so this might look like:

postgresql://[email protected]/otherdb?connect_timeout=10&application_name=myapp

Again, this all depends on your client library.

I can see clearly now

So now that we have the configuration in place, and have restarted all of our apps, a modification to our earlier Agent configuration code for postgres.yaml would look like:

custom_metrics:
  - # Postgres Connection state
    descriptors:
      - [datname, database]
      - [application_name, application_name]
      - [state, state]
    metrics:
      COUNT(state): [postgresql.connection_state, GAUGE]
    query: >
      SELECT datname, application_name, state, %s FROM pg_stat_activity
      GROUP BY datname, application_name, state HAVING COUNT(state) > 0;
    relation: false

With this extra dimension in place, we can craft queries like this:

sum:postgresql.connection_state{state:idle} by {application_name}

04-conn_state-idle-by-app_name

So now I can see that my worker-medium application has the most idling connections, so there’s some tuning to be done here – either I open too many connections for the application, or it’s not doing much.

I can confirm this with refining the query structure to narrow in on a single application_name:

sum:postgresql.connection_state{application_name:worker-medium} by {state}

05-conn_state-app_name-by-state

So now that I’ve applied methodology of surfacing connection states, and increased visibility into what’s going on, before making any changes to resolve.

Go forth, measure, and learn how your systems evolve!

There’s a New Player in Town, named Habitat

You may have heard some buzz around the launch of Chef‘s new open source project Habitat (still in beta), designed to change a bit of how we think about building and delivering software applications in the modern age.

There’s a lot of press, video announcement, and even a Food Fight Show where we got to chat with some of the brains behind the framework, and get into some of the nitty-gritty details.

In the vibrant Slack channel where a lot of the fast-paced discussion happens with a bunch of the core habitat developers, a community member had brought up a pain point, as many do.
They were trying to build a Python application, and had to result to playing pretty hard with either the PYTHONPATH variable or with sys.path post-dependency install.
One even used Virtualenv inside the isolated environment.

I had worked on making an LLVM compiler package, and while notoriously slow to compile on my laptop, I used the waiting time to get a Python web application working.

My setup is OSX 10.11.5, with Docker (native) 1.12.0-rc2 (almost out of beta!).

I decided to use the Flask web framework to carry out a Hello World, as it would prove a few of pieces:

  • Using Python to install dependencies using pip
  • Adding “local” code into a package
  • Importing the Python package in the app code
  • Executing the custom binary that the Flask package installs

Key element: it needed to be as simple as possible, but no simpler.

On my main machine, I wrote my application.
It listens on port 5000, and responds with a simple phrase.
Yay, I wrote a website.

Then I set about to packaging it into a deliverable where, in habitat’s nomenclature, it becomes a self-contained package, which can then be run via the habitat supervisor.

This all starts with getting the habitat executable, conveniently named hab.
A recent addition to the Homebrew Casks family, installing habitat was as simple as:

$ brew cask install hab

habitat version 0.7.0 is in use during the authoring of this article.

I sat down, wrote a plan.sh file, that describes how to put the pieces together.

There’s a bunch of phases in the build cycle that are fully customizable, or “stub-able” if you don’t want them to take the default action.
Some details were garnered from here, despite my package not being a binary.

Once I got my package built, it was a matter of figuring out how to run it, and one of the default modes is to export the entire thing as a Docker image, so I set about to run that, to get a feel for the iterative development cycle of making the application work as configured within the habitat universe.

(This step usually isn’t the best one for regular application development, but it is good for figuring out what needs to be configured and how.)

# In first OSX shell
$ hab studio enter
[1][default:/src:0]# build
...
   python-hello: Build time: 0m36s
[2][default:/src:0]# hab pkg export docker miketheman/python-hello
...
Successfully built 2d2740a182fb
[3][default:/src:0]#

# In another OSX shell:
$ docker run -it -p 5000:5000 -p 9631:9631 miketheman/python-hello
hab-sup(MN): Starting miketheman/python-hello
hab-sup(GS): Supervisor 172.17.0.3: cb719c1e-0cac-432a-8d86-afb676c3cf7f
hab-sup(GS): Census python-hello.default: 19b7533a-66ba-4c6f-b6b7-c011abd7dbe1
hab-sup(GS): Starting inbound gossip listener
hab-sup(GS): Starting outbound gossip distributor
hab-sup(GS): Starting gossip failure detector
hab-sup(CN): Starting census health adjuster
python-hello(SV): Starting
python-hello(O):  * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

# In a third shell, or use a browser:
$ curl http://localhost:5000
Hello, World!

The code for this example can be found in this GitHub repo.
See the plan.sh and hooks/ for Habitat-related code.
The src/ directory is the actual Python app.

At this point, I declared success.

There’s a large amount of other pieces to the puzzle that I hadn’t explored yet, but getting this part running was the first one.
Items like interacting with the supervisor, director, healthchecks, topologies – these have some basic docs, but there’s not a bevy of examples or use cases yet to lean upon for inspiration.

During this process I uncovered a couple of bugs, submitted some feedback, and the team is very receptive so far.
There’s still a bunch of rough edges to be polished down, many around the documentation, use cases and how the pieces fit together, and what benefit it all drives.

There appears to be some hooks for using Chef Delivery as well – I haven’t seen those yet, as I don’t use Delivery.
I will likely try looking at making a larger strawman deployment to test these pieces another time.

I am looking forward to seeing how this space evolves, and what place habitat will take in the ever-growing, ever-evolving software development life-cycle, as well as how the community approaches these concepts and terminology.

Fixing unintended consequences of the past

In the age of technology, everyone races forward to get the win. Anything that can provide you the competitive edge is considering important.
This is especially true in the realm of web media, where optimizing for page load times, providing secure transport, adhering to standards can make a difference in how a site is handled by client browsers, ranked by search engines, and most importantly how it is seen by viewers.

To this end, there are many sites, services and companies that will provide methods to audit a site and point out what could be problematic – count broken links, produce reports of actionable corrections, and more.
Some are better than others, and occasionally, you’ll come across something you’ve never seen before.

Recently, I was pinged about pages on a site that is hosted on an Amazon Simple Storage Service (S3) website-enabled bucket.
Since S3 is an object store only, this means that the pages in this site are statically generated and there is no associated web server, backend database, or other components to serve the pages.

This model is becoming more common for sites that can be simplified to run with no dynamic loading of data from a database, withstand heavy bursts of requests, as well as run cheaply (there’s even a free tier, beyond which pricing still remains affordable).

The idea is that you create your content in one format, run a compiler process to generate all the rendered files containing the links and content, and then upload the the compiled files to the S3 location to be requested by browsers. There are many guides on the web on how to do this – I’m not going to link to any now, search and ye shall find.

This particular site had been deployed since 2011 – and the mechanism to copy compiled files to S3 has been using the popular open source command line tool s3cmd – deployment basically looked like this (and still does!):

 s3cmd sync output/ s3://www.mysite.com

where output/ contains the compiled files, ready for deployment.

This has worked very well for over 4 years – until it came to my attention that when uploading to S3, the s3cmd tool was adding some metadata to each file as it uploaded it, as part of the design to support website hosting on S3.

For instance, when uploading a .css file to S3, s3cmd attempts to determine extra details about the file, and set the correct metadata for browsers to understand, such as Content-Type: text/css.
This is a critical function, as it would be difficult to take the time to determine each file’s content type, set that manually, across many files.
You can read more about content media types on Wikipedia.

Since this project was set up a long time ago, the version of s3cmd used as still in alpha stage – and it was used because it performed well enough, and nothing broke, so we were happy to continue running the with same version since early 2013.

The problem reported to me was that many files on the site were returning an invalid Content-Encoding value, something that has been typically not a problem, as the client’s browser will send an Accept-Encoding header when making a request, typically something along the lines of:

Browser: Hi there! Can I have this resource, and I'll accept a response encoded in the following formats: a, b, or c
Server: No problem! Here's the resource you're looking for, with a content encoded in XYZZY

Now, the XYZZY in this example was being set by the s3cmd upload process, and it was determined to be a bug and fixed in late 2013, but since we never knew about the problem, and the site loads just fine, we never addressed it.
There have been even more stability fixes and releases of s3cmd since – as recently as February 2015.

The particular invalid encodings being set were UTF-8 and ANSI_X3.4-1968. While these are valid encodings for files, they are invalid values for the Content-Encoding field.

Here’s an example of how to show the headers of a particular remote file:

$ curl -sI http://www.mysite.com/static/css/style.css | grep Content
Content-Encoding: ANSI_X3.4-1968
Content-Type: text/css
Content-Length: 7073

Many modern browsers will send something along the lines of ‘Accept-Encoding: gzip, deflate, sdch‘ in their request header, in hopes that the server can respond with one that matches, and then save on overall bytes sent over the wire, to speed up pages.

It’s the responsibility of the client (browser) to handle the response. I looked into the source code of Chromium (the basis for Google Chrome), and can see from here that in my example above, at Content-Encoding type of XYZZY will pretty much be ignored, which in this case, is fine, since we’re sending an invalid type.

So there’s no direct user impact, why should we care? Well, according to some popular ranking engines:

Using non-HTML content types for landing pages results in significantly reduced SEO ranking.

So all of this is fine, cool – update s3cmd tool to a newer version, and upload the output files again? Well, it’s not that simple.

Since during a sync operation, s3cmd determines what files might have changed, and only uploads the changed ones, it doesn’t reset the object metadata, as this is basically a new object, and the file itself hasn’t been changed.

One solution might be to edit every file, add an extra space somewhere – maybe an extra blank line at the end – then compile, deploy the changed files – however this might take too long.

Instead, I decided to solve the problem of iterating over every object in a bucket, and checking to see if it had the incorrect Content-Encoding set, and create a new copy of the file without the heading set.

This was pretty straightforward, once I understood the concept of object immutability – once written, you can’t change it, rather what feels like a change from a user interface actually creates a new version of the object with the new settings/metadata.

I also didn’t want to have to download each file locally and then upload it back to S3 – that it a slow operation, and could result in extra network traffic and disk space consumption.

Instead, I used the AWS SDK for Ruby gem, and came up with a short-and-sweet solution:

The code aims to be short and sweet, and sure enough, post-execution, we get the response without the offending header:

$ curl -sI http://www.mysite.com/static/css/style.css | grep Content
Content-Type: text/css
Content-Length: 7073

This swift diagnosis and resolution would not have been possible had the tooling being used not been open source, as many times I was trying to figure out why something behaved the way it did, and while not being familiar with the code, I could reason enough about how things work in general to apply that reasoning on how I should implement my resolution.

Support open source where possible, and happy hunting!

Read more on the standards RFC2616.

Tracking application performance on Heroku with Datadog

I thought about using a clickbait title – “You’ll never believe how this guy captures metrics!” – but decided that 99% of these are not worth the time invested in coming up with the catch title.

So instead, I’ll simply talk about what I wanted to, and you be the judge of my title.

Application Performance Monitoring, or APM, is a crazily complex landscape, with an enormous amount of tooling, terminology, and providers looking to get some piece of the action.
There are many vendors, and all have their advantages, as well as disadvantages.

The vendor that I am pretty happy with (and I now work there) is Datadog.

One solution that has caught on quite well for surgical application monitoring is the use of the statsd protocol to send metrics from inside your application to a listener which can then store these metrics for querying later on. This is achieved by placing strategic “emitter” callouts in your code so that they can report metrics during runtime.

Flickr, then Etsy have started these projects, and they have been refined, ported to most languages, and are seeing adoption in companies where a focus on measuring is an important goal.
A blog post on Datadog’s implementation and extension of Statsd was written last year and goes into deeper detail.

One common question has always been “How do I collect metrics from an application running on Heroku with Datadog?”.

And I think we finally have one answer.

The Heroku Dyno container is pretty simple – you wanna run a process? Describe it in a Procfile.
You wanna scale? You tell Heroku to launch more Dynos with the process name, as specified in the Procfile.

However, the actual Dyno is a fairly limited environment by design – the root filesystem is read-only, the only writable area is in the application’s root directory, and disappears when terminated. There’s no sysvinit, upstart or systemd for people to bicker about. Use a Procfile, which is also really simple.

So a challenge to overcome became: “how to install a Datadog Agent package that runs a dogstatsd listener as a second process, inside an environment that is pretty locked down?”

First, we have to install the package. Heroku has a concept of “[buildpacks]”(https://devcenter.heroku.com/articles/buildpacks) that can be used to run compilation steps before adding your application code and launching it. The use of multiple buildpacks is also available, to chain steps together to achieve the desired outcome.

I read the heroku-buildpack-apt and found a bunch of good ideas, and came up with a Datadog-Agent-specific installer buildpack that drops off the package, as well as the needed environment for the runtime.

Now how do I run the listener process alongside my application?

Enter foreman. Foreman, not to be confused with “theforeman“, has long been a great way for application developers writing Heroku-targeted applications to run them locally in a similar manner that they will be run on the remote platform.

Foreman reads the Profile, and runs the processes based on the directives contained inside.

This feature is the one that we leverage to run multiple processes on a single Dyno.

By using foreman inside the Dyno, we are able to tell foreman to run more than one process type at a time, with another Procfile that specifies the startup process for the actual application as well as the dogstatsd listener.

When deploying any code revision, Heroku will read the base Procfile, and run a foreman process inside the Dyno, which will in turn, start up the app & dogstasd.

And while foreman is a Ruby gem, your project may be in Python (use honcho), Go (use forego or goreman) and I’m sure there are others out there. I haven’t found or tested all of them, tell me if they work out for you.

I did, however, take the time to write up a README with the procedure to follow to use this, as well as commit-by-commit example application.

Here’s the buildpack code: http://miketheman.github.io/heroku-buildpack-datadog/

Here’s the example application: https://github.com/miketheman/buildpack-example-ruby

Here’s an image of the stats collected by the example application in Datadog, with increasing web load:
Heroku App Load

Here’s a random dog:

Hope this helps you find deeper insight into how you monitor your applications!

Update (2014-12-15)

A quick addition on this topic.

A couple of days after this was published, I had a short Twitter exchange with Bo Jeanes, after which he submitted a Pull Request to the buildpack, (as well as an update to the example app).
This simplifies the end-user’s deployment of the Agent package, in that the user no longer has to spend any time on doing Procfile-in-Procfile solutions, as well as remove the need from foreman and the like from inside the container, rather the dogstatsd process will be started via the profile.d mechanism which is run on Dyno startup.

This makes the solution even more elegant, so thanks a ton, Bo!

On the passage of time and learning

It’s been just over two years since I first wrote a little tool to help me visualize the relationships between objects in a particular system.

I had been working as a consultant for a couple of companies, and I found that all exhibited similar problems of using a powerful system, creating ad-hoc relationships where needed, and not fully following the inheritance and impact of these relationships when they change.

So coming and trying to first understand what was there, and then trying to untangle things to be clearer (and hopefully better), I tried to sit down and draw out in a physical space – probably a whiteboard – all of the objects, their relationships, and “who talks to whom” diagram.

Sidebar: diagrams and visualizations are awesome. A picture is many times worth a thousand words, which is why using pictures and visual representations of hard-to-perceive patterns is key to helping others understand what you may already know.

I quickly realized a few things that were problematic with this manual approach:

  1. There were too many objects and relationships to express effectively and clearly on a whiteboard.
  2. Every time something changed in the objects or their relationships, I had to modify the diagram or start over.
  3. This is probably not the last time I’m going to have this problem, and I’m getting really good at drawing boxes with arrows.

With these things in mind, I sat down and tried to reverse-engineer my own thought process. I knew what kind of visual end result I wanted, so I started by using an open source library that helps place things in relationship to other things, and then renders that as an image.

Once I was able to manually generate the image based on the input I provided, then the focus was to use dynamic input, which was the big win, as then I could point this at any input, and get a picture rendered.

Next was packaging and testing, which became harder and harder – but I kept going and eventually was happy with the results.

There have been over 750 downloads of that first version, and I’ve tweaked a few things here and there over time, but haven’t really done much to change the actual code to incorporate any further features.

“It works, I’m done.”

Looking back at the code I wrote (all told – less than 100 lines!), I realize that if I wanted to change behavior today, it’s much harder to do, as the code itself doesn’t lend itself to be changed.

I hadn’t written any testing around the code itself, only functional tests around “if I press START, do I reach END correctly?” approaches – sometimes termed “Outside-In” testing, as the test will assert that from the outside, everything looks groovy.

These tests are slower, not as comprehensive, as trying to have a test system look at a rendered image and compare it with a known “good” one isn’t trivial either. Some libraries exist, but what if I change the assumptions of what a “good” image is? Update the comparison image? Too much work, says the lazy person in me.

So the code exists, and works, and continues to function, over time.

I take a look at it recently, and realize that it’s all one big function (also known as a ‘method’). And some measurement tools out there state that the method is simply too complex.

How can it be too complex? This method is less than 70 lines of actual code, it can’t be that complex, can it?

In the time since I’ve written this code, I’ve learned a lot, heard a lot, failed a lot, and written a lot more code, and thanks to untold amounts of other people, I’ve been getting a bit better at it.

Here’s where I’ll drop in a reference to Sandi Metz, author of POODR, and more, and talk she gave earlier this year, and I didn’t see in person, only on Youtube. It’s called “All the Little Things”, in which she takes you on a journey of looking at code, refactoring and testing, and basically how to change things to make further changes easy.

It’s a load of information, and a lot of it may not make sense if you’ve never encountered these problems and ideas. But having these ideas (and other design principles and patterns) in your toolbox enable forward progress in your own understanding of how you approach solving problems is really helpful in not only solving the problem today, but helping you solve the problems you don’t even know about yet.

Now I look at that code, and say to myself, “Wow, I don’t really want to change anything in there until I have some better testing around parts of it”. This makes it harder to add anything new, since I don’t know what existing functionality I may break when adding new things.

So if you wrote some code, and let it sit for a while, and look at it a year or two later, you may find yourself shaking your head, with the “who even wrote this mess?” knowing full well that your past self did it.

Be kind to your future self, and try to make decisions today that will help your future self understand what choices you made and why you made them. It’s likely that your future self will have learned more by then and may make other decisions, but will appreciate the efforts of present self in the future.
It’s a weird kind of time-travel, and in the present, you’re trying to better your own future. (cue time paradox arguments)

Thanks for reading!

The Importance of Dependency Testing

Recently I revisited an Open Source project I started over a year ago.

This tool is built to hook into a much larger framework (Chef), and leverages a bunch of code many other people have written, and produce a specific result that I was looking for.

This subject is less about the tool itself, rather the process and procedure involved in testing dependencies.

This project is written in Ruby, and as many have identified in articles and tweets, some project maintainers don’t adhere to a versioning policy, making it hard to ensure working software across multiple versions of dependencies.
A lot rides on the maintainer’s adherence to a versioning standard – one very popular one is Semantic Versioning, or SemVer for short.

This introduces a few other questions, like how frequently should a writer release new versions of code, how frequently should users upgrade to leverage new fixes, features, etc.

In any case, my tool was restricted to running the framework’s version 10.x, considering that between major versions, functionality may change, and that there is no guarantee that my tool will continue working.

A new major version of Chef was released earlier this year and most of my existing projects are still on Chef 10.x, as this is still being updated with stability fixes and security patches, and the ‘jump’ to 11 is not on the schedule right now, so my tool continues functioning just fine.

Time passes, and I have a project running Chef 11 that I want to use my tool with.

Whoops. There’s a constraint built in to the tool’s syntax of dependencies that will report that “you have Chef 11, this wants Chef 10.x and not higher, have a nice day”.

So I change the constraint, install locally, and see that it works. Yay!

Now I want to commit the change that I made to the version constraint logic, but I want to continue testing the tool against the 10.x versions, as I should continue to support the active versions for as long as they are alive and in use, right?

A practice I was using for the tests that I had written was: given a static list of Chef versions, use the static entry as the Chef version for installation/test.

This required me to update the static list each time a new version of Chef was released, and potentially was testing against versions that didn’t need testing – rather I wanted to test against the latest of the mainline release.

I updated my constraint, ran the test suite that I’ve written, and whoops, it failed the tests.

Functionality-wise, it worked correctly on both versions, so the problem must be in my test suite, right?

I found a cool project called Appraisal, that’s been around for a while, and
used by a bunch of other projects, and you can read more about it here.
It allows one to specify multiple version constraints and test against each of them with the same test suite.

Sure enough, passes on version 10, not version 11. Same code, same tests. #wat

So now it’s time to do some digging. I read through some of the Chef ChangeLog, and decide there’s too much to wade through there, rather let’s take a look at the code my tool is using.

The failure was triggering here, and was showing a default value.
This meant that Chef was no longer loading the configuration file that I provide here correctly.

So I took a look at the current version of the configuration loader, and visually compared it with the 10.x version.
Sure enough, there’s one small change that’s affecting me: Old vs New

working_directory? What’s this? Oh, it’s over here, just a few lines prior.

Reading the full commit, and the original ticket, it seems like this is indeed a good idea, but why are my tests failing?

After further digging around in the aruba test suite extension I’m using, I realize that the environment variable PWD remains set to the actual working directory of my shell, not the test suite’s subprocesses.
Thus every time it runs, the chef_config_dir is looking in my current directory, not the directory the tests are running in.

After poking around aruba’s source code, and adding some debugging statements during test runs, I figured out that I need the test suite to change it’s PWD environment variable based on the test’s execution, which led to this commit.

Why is this different? Well, before, Ruby’s Dir.pwd statement would be invoked from inside the running test, loading the config from a location relative to Dir.pwd, where I was placing the test config file.
Now the test was trying to load the config from the process’ environment variable PWD instead, and failing to find the config.

Tests, pass, and now I can have Travis CI continue to test my code with multiple dependencies when it changes and catch things before they go badly.

All in all, an odd behavior to expect in a normal situation, as my tool is mean to be run interactively by a user, not via a test suite that mocks up all sorts of other environments.

So I spent about 2-3 hours digging around to essentially change one line that makes things work better and cleaner than before.

Worth it? Completely, as these changes will allow me to continue to ensure that my tool remains working with upstream releases of the framework, and maintain compatibility with supported versions of the framework.

TL, DR: Don’t skimp on testing you project against multiple versions of external dependencies, especially when your target users are going to be using more than one possible version.

P.S. Shout out to my girlfriend that generously lets me spend time hacking on these kind of things 😀

My foray into web development

I like browsing the web. I do it a lot throughout my day.

A lot of people work hard at making the web a cool-looking place. Some sites make simplicity look so easy, that when you look under the hood, it’s all chaos and destruction, folded and crunched together, all to present something really nice and smooth for the end-user.

I’m not a developer – much less a web dev. There’s a lot to know in any field of computing – and in web it’s pretty much the most visible part of computing as a whole, since pretty much anyone anywhere is going to use a web browser to view a site at some point.

I mean, sure, we all learned some HTML – hell, I wrote some sites back in the days of Geocities, and it was awesome to learn about tiling backgrounds of animated GIFs, and when CSS came around, minds blown!

And I left that field for the frontend developers, and went into infrastructure and operations.

And as time passes, you find yourself managing a variety of systems and knowledge, and at some point, you may say to yourself, “I wish I knew how to answer this question…”

And then you write some code to answer it. Voila! You’re a developer, of sorts.

I’m a huge fan of data visualization. Telling stories with pictures dates back millennia, and it’s very relatable to most people. Recently, I wrote a tool to help myself display the dependency complexity of Chef roles, and I found that, while being very useful, the output is very limited, as it’s a static generated image, whereas we live in a web-friendly world where everything is interactive and fun!

So when I came across another hard question I wanted to answer, I thought, “Why not make this a web application?”

This time, the question I wanted to answer was: As a GitHub Organization owner for my company, what human-to-team-to-software-repository relationships do we have, and are they secure?

If you’ve ever managed an Organization in GitHub, there are a few key elements.

  1. An Organization can have many Repositories
  2. An Organization can have many Teams
  3. A Repository can have many Teams
  4. A Team can have many members, but only one permission (read only, read/write, owner)

So sorting out who is on what Team, what access they have, across many repositories, can be a security nightmare. Especially when you have more than 4-5 repositories.

During my first foray into solving this, I cobbled together a command-line tool, using Ruby with the Graphviz library. I’ve like Graphviz for years – it’s straightforward, as structured text gets rendered into a graph and then can be output to a file.

Very straightforward, has some limitations, but basically allows you to store graphs as text, and re-render them when changes happen. Basically, it’s like storing source code and not the binary output.

But since there were some limitations, and I wanted this new question to be more than a command-line tool, something I could share with the world at large, without requiring any client-side installation of any tools or dependencies.

So I spent a lot of time hemming and hawing, looking at web frameworks and trying to figure out some of them, and “how does this work?” came up a lot.

Finally, yesterday I set out to sit down and accomplish this task. I sat in a Starbucks in New York City, and had a Venti. I started banging away at about 11:30. I took a break for a refill and a snack around 1:30, and when I sat down again, I kept hacking away until 9:30pm, when I deemed completion.

The code was written, tested by me locally, pushed to GitHub, deployed to Heroku, DNS name wired up and all. As soon as I completed, I left Starbucks, and heaved a huge sigh – it was one hell of a mental high, I was in “the zone” and had been there for a long time.

You are more than welcome to browse the source code here and the finished project here. I call it the GitHub Organization Viewer, hence “GOVweb”.

I have a bunch of other ideas on how to make this better, how to model the data, which visual style to use, but I think for now, I’m going to leave it for a bit, and see what I think about it in a couple of months.

But all in all, this reinforced my opinion to never be afraid to try tackling a new idea, a new project, a new field you’re unfamiliar with – as long as you can read, comprehend and learn, the world is your oyster.

A picture is worth a (few) thousand bytes

(Context alert: Know Chef. If you don’t, it’s seriously worth looking into for any level of infrastructure management.)

TL;DR: I wrote a Knife plugin to visualize Chef Role dependencies. It’s here.

Recently, I needed to sort out a large amount of roles and their dependencies, in order to simplify the lives of everyone using them.

It wasn’t easy to determine that changing one would affect many others, since it had become common practice to embed roles within other roles’ run_list, resulting in a tree of cross-dependency hell.
A node’s run_list would typically contain a single role-specific item, embedding the lower-level dependencies.

A sample may look like this:

node[web1] => run_list = role[webserver] => run_list = role[base], recipe[apache2], ...
node[db1] =>  run_list = role[database]  => run_list = role[base], recipe[mongodb], ...

Many of these roles had a fair amount of code duplication, and most were setting the same base role, as well as any role-specific recipes. Others were referencing the same recipes, so figuring out what to refactor and where, without breaking everything else, was more than challenging.

The approach I wanted to implement was to have a very generalized base role, apply that to every instance, then add any specific roles should be applied as well to a given node.

After refactoring node’s run list would typically look like:

node[web1] => run_list = role[base], role[webserver]
node[db1] =>  run_list = role[base], role[database]

A bit simpler, right?

This removes the embedded dependency on role[base], since the assumption is that every node with have role[base] applied to it, unless I don’t want to for some reason (some development environment for instance).

Trying to refactor this was pretty tricky, so I wrote a visualizer to collect all the roles from a Chef repository’s role_path, parse them out, and create an image.

I’ve used Graphviz for a number of years now, and it’s pretty general-purpose when it comes to creating graphs of things (nodes), connecting them (edges), and rendering an output. So this was my go-to for this project.

Selling you on the power of visualizing data is beyond the scope of this post (and probably the author), but suffice to say there’s industries built around putting data into visual format for a variety of reasons, such as relative comparison, trending, etc.
In fact some buddies of mine have built an awesome product that does just that – visualizes data and events over time. Check them out at Datadog. (I’ve written other stuff for their platform before, it’s totally awesome.)

In my case, I wanted the story told by the image to:

  1. Demonstrate the complexity of the connections between roles/recipes (aka spaghetti)
  2. Point out if I have any cyclic dependencies (it’s possible!)
  3. Let me focus on what to do next: untangle

Items 1 & 2 were pretty cool – my plugin spat out an increasingly complex graph, showing relationships that made sense for things to work, but also contained some items with 5-6 levels of inheritance that are easily muddled. I didn’t have any cyclic dependencies, so I created a sample one to see what it would look like. It looked like a circle.

Item 3 was harder, as this meant that human intervention needed to take place. It was almost like deciding on which area of a StarCraft map you want to go after first. There’s plenty of mining to do, but which will pay off fastest? (geeky references, are you surprised?)

I decided on some of the smaller clusterings, and made some progress, changing where certain role statements lived and the node <=> role assignment to refactor a lot out.

My process of writing a plugin developed pretty much like this:

  1. Have an idea of how I want to do this
  2. Write some code that when executed manually, does what I want
  3. Transform that code into a knife plugin, so it lives inside the Chef Ecosystem
  4. Package said plugin as RubyGem, to make distribution easy for others
  5. Test, test, test (more on this in a moment)
  6. Document (readme only for now)
  7. Add some features, rethink of how certain things are done, refactor.
  8. Test some more

Writing code, packaging and documentation are pretty standard practices (more or less), so I won’t go into those.

The more interesting part was figuring out how to plug into the Chef/Knife plugins architecture, and testing.

Thanks to Opscode, writing a plugin isn’t too hard, there’s a good wiki, and other plugins you can look at to get some ideas.

A couple of noteworthy items:

  1. Figuring out how to provide command-line arguments to OptionParser was not easy, since there was no real intuitive way to do it. I spent about 2 hours researching why that wasn’t doing what I wanted, and finally figured out that "--flag" and "--flag " behave completely different.

  2. During my initial cut of the code, I used many statements to print output back to the user (puts "some message"). In the knife plugin world, one should use the ui.info or ui.error and the like, as this makes it much cleaner and consistent with other knife commands.

Testing:

Since this is a command-line application plugin, it made sense to use a framework that can handle inputs and outputs, as that’s my primary concern.
With a background in systems administration and engineering, software testing has never been on the top of my to-learn list, so when the opportunity arose to write tests for another project I wrote, I turned to Cucumber, and the CLI extension Aruba.

Say what you will about unit tests vs integration tests vs functional tests – I got going relatively quickly writing tests in quasi-English.
I won’t say that it’s easy, but it definitely made me think about how the plugin will be used, how users may input commands differently, and what they can expect to happen when they run it.

Cucumber/Aruba also allowed me to split my tests in a way that I can grok, such as all the CLI-related commands, flags, options exist in one test ‘feature’ file, whereas another feature file contains all the tests of reading the roles and graphing them in different formats.

Writing tests early on allowed me to continue to capture how I thought the plugin will be used, write that down in English, and think about it for awhile.
Some things changed after I had written them down, and even then, after I figured out the tests, I decided that the behavior didn’t match what I thought would be most common.

Refactoring the code, running tests in between to ensure that the behavior that I wanted remained consistent was very valuable. This isn’t news for any software engineers out there, but it might be useful to more system people to learn more about testing.

Another test I use is a style-checker called tailor – it measures up my code, and reports on things that may be malformed. This is the first test I run, as if the code is invalid (i.e. missing a end somewhere), it won’t pass this test.

Putting these into a test framework like Travis-CI is so very easy, especially since it’s a RubyGem, and I have set up environment variables to test against specific versions of Chef.
This provides the fast-feedback loop that tests my code against a matrix of Ruby & Chef versions.

So there you have it. A long explanation of why I wrote something. I had looked around, and there’s a knife crawl that is meant to walk a given role’s dependency tree and provide that, but that only worked for a single role, and wasn’t focused on visualizing.

So I wrote my own. Hope you like it, and happy to take pull requests that make sense, and bug reports for things that don’t.

You can find the gem on RubyGems.org – via gem install knife-role-spaghetti or on my GitHub account.

I’m very curious to know what other people’s role spaghetti looks like, so drop me a line, tweet, comment or such with your pictures!

Quick edit: A couple of examples, showing what this does.

Sample Roles

(full resolution here)

Running through the neato renderer (with the -N switch) produces this image:

Sample Roles Neato

(full resolution here