Extending ECS Auto-scaling for under $2/month with Lambda

The Problem

Amazon Web Services (AWS) is pretty cool. You ought to know that by now. if you don’t, take a few hours and check out some tutorials and play around.

One of the many services AWS provides is the EC2 Container Service (ECS), where the scheduling and lifecycle management of running Docker containers is handled by the ECS control plane (probably magic cooked up in Seattle over coffee or in Dublin over a pint or seven).

You can read all about its launch here.

One missing feature from the ECS offering in comparison to other container schedulers was the concept of scheduling a service to be run on each host in a cluster, such as a logging or monitoring agent.
This feature allows clusters to grow or shrink and still have the correct services running on each node.

A published workaround was to have each node individually run an instance of the defined task on startup, which works pretty well.

The downside here is is that if a task definition changes, ECS has no way of triggering an update to the running tasks – normal services will stop then start the task with a new definition, and use your logic to maintain some degree of uptime.
To achieve the update, one must terminate/replace the entire ECS Container Instance (the EC2 host) and if you’re using AutoScalingGroups, get a fresh node with the updated task.

Other Solutions

  • Docker Swarm calls this a global service, and will run one instance of the service on every node.
  • Mesos’ Marathon doesn’t support this yet either, and is in deep discussion on GitHub on how to implement this in their constraints syntax.
  • Kubernetes has a DaemonSet to run a pod on each node.
  • The recently-released ECS-focused Blox provides a daemon-scheduler to accomplish this, but brings along extra components to accomplish the scheduling.

Back to ECS

So imagine my excitement when the ECS team announced the release of their new Task Placement Strategies last week, offering a “One Task Per Host” strategy as part of the Service declaration.
This indeed is awesome and works as advertised, with no extra components, installs, schedulers, etc.

However! Currently each Service requires a “Desired Count” parameter of how many instances of this service you want to run in the cluster.

Given a cluster with 5 ECS Container Instance hosts, setting the Desired Count to 5 ensures that one runs on each host, provided there are resources available (cpu, ram, available port).

If the cluster grows to 6 (autoscaling, manually adding, etc), there’s nothing in the Service definition that will increase the desired count to 6, so this solution is actually worse off than our previous mode of using user-data to run the task at startup.

One approach is to arbitrarily raise the Desired count to a very high number, such as 100 for this cluster, with the consideration that we are unlikely to grow the cluster to this size without realizing it.
The scheduler will periodically examine the cluster for placement, and handle any hosts missing the service.

The problem with this is that it’s not deterministic, and CloudWatch metrics will report these unplaced tasks as Pending, and I have alarms to notify me if tasks aren’t placed in clusters, as this can point to a resource allocation mismatch.

Enter The Players

To accomplish an automated service desired count, we must use some elements to “glue” a few of the systems together with our custom logic.

Here’s a sequence diagram of the conceptual flow between the components.

UML Sequence Flow

Every time there is a change in an ECS Cluster, CloudWatch Events will receive a payload.
Based on a rule we craft to select events classified as “Container Instance State Change”, CW Events will emit an event to the target of your choice, in our case, Lambda.

We could feasibly use a cron-like schedule to fire this every N minutes to inspect, evaluate, and remediate a semi-static set of services/cluster, but having a system that is reactive to change feels preferable to poll/test/repair.

A simple rule that captures all Container Instance changes:

{
  "source": [
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Container Instance State Change"
  ]
}

You can restrict this to specific clusters by adding the cluster’s ARN to the keys like so:

  "detail": {
    "clusterArn": [
      "arn:aws:ecs:us-east-1:123456789012:cluster/my-specific-cluster",
      "arn:aws:ecs:us-east-1:123456789012:cluster/another-cluster"
    ]
  }

If being throttled or cost is a concern here, you may wish to filter to a set of known clusters, but this reduces the reactiveness of the logic to new clusters being brought online.

The Actual Logic

The Lambda function receives the event, performs some basic validation checks to ensure it has enough details to proceed, and then makes a single API call to the ECS endpoint to find our specified service in the cluster that fired the change event.

If no such service is found, we terminate now, and move on.

If the cluster does indeed have this service defined, then we perform another API call to describe the count of registered container instances, and compare that with the value we already have from the service definition call.

If there’s a mismatch, we perform a final third API call to adjust the service definition’s desired task count.

All in all, a maximum total of 3 possible API calls, usually in under 300ms.

In my environment, I want this task to apply to every cluster in my account, as we later on inspect the cluster to see if it has a service definition applied to it, to act upon.
In my ballpark figures with a set of 10 active clusters, the cost for running this logic should be under $2/month – yes, two dollars a month to ensure your cluster has the correct number of tasks for a given service.
Do you own estimation with the Lambda Pricing Calculator.

Conclusions

The code can be found on GitHub, and was developed with test-everything philosophy, where I spent a large amount of time learning how to actually write the code and tests elegantly.
Writing out all of the tests and sequences allowed me to find multiple points of refactoring and increased efficiency from my first implementation, leading to a much cleaner solution.
Taking on a project like this is a great way to increase one’s own technical prowess, leading to the ability to reason about other problems.

While I strongly believe that this feature should be part of the ECS platform and not require any client-side intervention, the ability to take the current offerings and extend them via mechanisms such as Events, Lambda and API calls further demonstrates the flexibility and extensibility of the AWS ecosystem.
The feature launched just over a week ago, and I’ve been able to put together an acceptable solution on my own, using the documentation, tooling, and infrastructure while minimizing costs and making my system more reactive to change.

I look forward to what else the ECS, Lambda and CloudWatch Events team cook up in the future!

There’s a New Player in Town, named Habitat

You may have heard some buzz around the launch of Chef‘s new open source project Habitat (still in beta), designed to change a bit of how we think about building and delivering software applications in the modern age.

There’s a lot of press, video announcement, and even a Food Fight Show where we got to chat with some of the brains behind the framework, and get into some of the nitty-gritty details.

In the vibrant Slack channel where a lot of the fast-paced discussion happens with a bunch of the core habitat developers, a community member had brought up a pain point, as many do.
They were trying to build a Python application, and had to result to playing pretty hard with either the PYTHONPATH variable or with sys.path post-dependency install.
One even used Virtualenv inside the isolated environment.

I had worked on making an LLVM compiler package, and while notoriously slow to compile on my laptop, I used the waiting time to get a Python web application working.

My setup is OSX 10.11.5, with Docker (native) 1.12.0-rc2 (almost out of beta!).

I decided to use the Flask web framework to carry out a Hello World, as it would prove a few of pieces:

  • Using Python to install dependencies using pip
  • Adding “local” code into a package
  • Importing the Python package in the app code
  • Executing the custom binary that the Flask package installs

Key element: it needed to be as simple as possible, but no simpler.

On my main machine, I wrote my application.
It listens on port 5000, and responds with a simple phrase.
Yay, I wrote a website.

Then I set about to packaging it into a deliverable where, in habitat’s nomenclature, it becomes a self-contained package, which can then be run via the habitat supervisor.

This all starts with getting the habitat executable, conveniently named hab.
A recent addition to the Homebrew Casks family, installing habitat was as simple as:

$ brew cask install hab

habitat version 0.7.0 is in use during the authoring of this article.

I sat down, wrote a plan.sh file, that describes how to put the pieces together.

There’s a bunch of phases in the build cycle that are fully customizable, or “stub-able” if you don’t want them to take the default action.
Some details were garnered from here, despite my package not being a binary.

Once I got my package built, it was a matter of figuring out how to run it, and one of the default modes is to export the entire thing as a Docker image, so I set about to run that, to get a feel for the iterative development cycle of making the application work as configured within the habitat universe.

(This step usually isn’t the best one for regular application development, but it is good for figuring out what needs to be configured and how.)

# In first OSX shell
$ hab studio enter
[1][default:/src:0]# build
...
   python-hello: Build time: 0m36s
[2][default:/src:0]# hab pkg export docker miketheman/python-hello
...
Successfully built 2d2740a182fb
[3][default:/src:0]#

# In another OSX shell:
$ docker run -it -p 5000:5000 -p 9631:9631 miketheman/python-hello
hab-sup(MN): Starting miketheman/python-hello
hab-sup(GS): Supervisor 172.17.0.3: cb719c1e-0cac-432a-8d86-afb676c3cf7f
hab-sup(GS): Census python-hello.default: 19b7533a-66ba-4c6f-b6b7-c011abd7dbe1
hab-sup(GS): Starting inbound gossip listener
hab-sup(GS): Starting outbound gossip distributor
hab-sup(GS): Starting gossip failure detector
hab-sup(CN): Starting census health adjuster
python-hello(SV): Starting
python-hello(O):  * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

# In a third shell, or use a browser:
$ curl http://localhost:5000
Hello, World!

The code for this example can be found in this GitHub repo.
See the plan.sh and hooks/ for Habitat-related code.
The src/ directory is the actual Python app.

At this point, I declared success.

There’s a large amount of other pieces to the puzzle that I hadn’t explored yet, but getting this part running was the first one.
Items like interacting with the supervisor, director, healthchecks, topologies – these have some basic docs, but there’s not a bevy of examples or use cases yet to lean upon for inspiration.

During this process I uncovered a couple of bugs, submitted some feedback, and the team is very receptive so far.
There’s still a bunch of rough edges to be polished down, many around the documentation, use cases and how the pieces fit together, and what benefit it all drives.

There appears to be some hooks for using Chef Delivery as well – I haven’t seen those yet, as I don’t use Delivery.
I will likely try looking at making a larger strawman deployment to test these pieces another time.

I am looking forward to seeing how this space evolves, and what place habitat will take in the ever-growing, ever-evolving software development life-cycle, as well as how the community approaches these concepts and terminology.

A picture is worth a (few) thousand bytes

(Context alert: Know Chef. If you don’t, it’s seriously worth looking into for any level of infrastructure management.)

TL;DR: I wrote a Knife plugin to visualize Chef Role dependencies. It’s here.

Recently, I needed to sort out a large amount of roles and their dependencies, in order to simplify the lives of everyone using them.

It wasn’t easy to determine that changing one would affect many others, since it had become common practice to embed roles within other roles’ run_list, resulting in a tree of cross-dependency hell.
A node’s run_list would typically contain a single role-specific item, embedding the lower-level dependencies.

A sample may look like this:

node[web1] => run_list = role[webserver] => run_list = role[base], recipe[apache2], ...
node[db1] =>  run_list = role[database]  => run_list = role[base], recipe[mongodb], ...

Many of these roles had a fair amount of code duplication, and most were setting the same base role, as well as any role-specific recipes. Others were referencing the same recipes, so figuring out what to refactor and where, without breaking everything else, was more than challenging.

The approach I wanted to implement was to have a very generalized base role, apply that to every instance, then add any specific roles should be applied as well to a given node.

After refactoring node’s run list would typically look like:

node[web1] => run_list = role[base], role[webserver]
node[db1] =>  run_list = role[base], role[database]

A bit simpler, right?

This removes the embedded dependency on role[base], since the assumption is that every node with have role[base] applied to it, unless I don’t want to for some reason (some development environment for instance).

Trying to refactor this was pretty tricky, so I wrote a visualizer to collect all the roles from a Chef repository’s role_path, parse them out, and create an image.

I’ve used Graphviz for a number of years now, and it’s pretty general-purpose when it comes to creating graphs of things (nodes), connecting them (edges), and rendering an output. So this was my go-to for this project.

Selling you on the power of visualizing data is beyond the scope of this post (and probably the author), but suffice to say there’s industries built around putting data into visual format for a variety of reasons, such as relative comparison, trending, etc.
In fact some buddies of mine have built an awesome product that does just that – visualizes data and events over time. Check them out at Datadog. (I’ve written other stuff for their platform before, it’s totally awesome.)

In my case, I wanted the story told by the image to:

  1. Demonstrate the complexity of the connections between roles/recipes (aka spaghetti)
  2. Point out if I have any cyclic dependencies (it’s possible!)
  3. Let me focus on what to do next: untangle

Items 1 & 2 were pretty cool – my plugin spat out an increasingly complex graph, showing relationships that made sense for things to work, but also contained some items with 5-6 levels of inheritance that are easily muddled. I didn’t have any cyclic dependencies, so I created a sample one to see what it would look like. It looked like a circle.

Item 3 was harder, as this meant that human intervention needed to take place. It was almost like deciding on which area of a StarCraft map you want to go after first. There’s plenty of mining to do, but which will pay off fastest? (geeky references, are you surprised?)

I decided on some of the smaller clusterings, and made some progress, changing where certain role statements lived and the node <=> role assignment to refactor a lot out.

My process of writing a plugin developed pretty much like this:

  1. Have an idea of how I want to do this
  2. Write some code that when executed manually, does what I want
  3. Transform that code into a knife plugin, so it lives inside the Chef Ecosystem
  4. Package said plugin as RubyGem, to make distribution easy for others
  5. Test, test, test (more on this in a moment)
  6. Document (readme only for now)
  7. Add some features, rethink of how certain things are done, refactor.
  8. Test some more

Writing code, packaging and documentation are pretty standard practices (more or less), so I won’t go into those.

The more interesting part was figuring out how to plug into the Chef/Knife plugins architecture, and testing.

Thanks to Opscode, writing a plugin isn’t too hard, there’s a good wiki, and other plugins you can look at to get some ideas.

A couple of noteworthy items:

  1. Figuring out how to provide command-line arguments to OptionParser was not easy, since there was no real intuitive way to do it. I spent about 2 hours researching why that wasn’t doing what I wanted, and finally figured out that "--flag" and "--flag " behave completely different.

  2. During my initial cut of the code, I used many statements to print output back to the user (puts "some message"). In the knife plugin world, one should use the ui.info or ui.error and the like, as this makes it much cleaner and consistent with other knife commands.

Testing:

Since this is a command-line application plugin, it made sense to use a framework that can handle inputs and outputs, as that’s my primary concern.
With a background in systems administration and engineering, software testing has never been on the top of my to-learn list, so when the opportunity arose to write tests for another project I wrote, I turned to Cucumber, and the CLI extension Aruba.

Say what you will about unit tests vs integration tests vs functional tests – I got going relatively quickly writing tests in quasi-English.
I won’t say that it’s easy, but it definitely made me think about how the plugin will be used, how users may input commands differently, and what they can expect to happen when they run it.

Cucumber/Aruba also allowed me to split my tests in a way that I can grok, such as all the CLI-related commands, flags, options exist in one test ‘feature’ file, whereas another feature file contains all the tests of reading the roles and graphing them in different formats.

Writing tests early on allowed me to continue to capture how I thought the plugin will be used, write that down in English, and think about it for awhile.
Some things changed after I had written them down, and even then, after I figured out the tests, I decided that the behavior didn’t match what I thought would be most common.

Refactoring the code, running tests in between to ensure that the behavior that I wanted remained consistent was very valuable. This isn’t news for any software engineers out there, but it might be useful to more system people to learn more about testing.

Another test I use is a style-checker called tailor – it measures up my code, and reports on things that may be malformed. This is the first test I run, as if the code is invalid (i.e. missing a end somewhere), it won’t pass this test.

Putting these into a test framework like Travis-CI is so very easy, especially since it’s a RubyGem, and I have set up environment variables to test against specific versions of Chef.
This provides the fast-feedback loop that tests my code against a matrix of Ruby & Chef versions.

So there you have it. A long explanation of why I wrote something. I had looked around, and there’s a knife crawl that is meant to walk a given role’s dependency tree and provide that, but that only worked for a single role, and wasn’t focused on visualizing.

So I wrote my own. Hope you like it, and happy to take pull requests that make sense, and bug reports for things that don’t.

You can find the gem on RubyGems.org – via gem install knife-role-spaghetti or on my GitHub account.

I’m very curious to know what other people’s role spaghetti looks like, so drop me a line, tweet, comment or such with your pictures!

Quick edit: A couple of examples, showing what this does.

Sample Roles

(full resolution here)

Running through the neato renderer (with the -N switch) produces this image:

Sample Roles Neato

(full resolution here

Ask your systems: “What’s going on?”

This is a sysadmin/devops-style post.
Disclaimers are that I work with these tools and people, and like what they do.

In some amount of our professional lives, we are tasked with bringing order to chaos, keep systems running and have the businesses we work for continue functioning.

In our modern days of large-scale computing, web technology growth explosions, multiple datacenter deployments, cloud providers and other virtualization technologies, the manpower needed to handle the vast amount of technologies, services and systems seems to have a pretty high overhead cost associated with it. “You’ve got X amount of servers? Let’s hire Y amount of sysadmins!”

A lot of tech startups start out with some of the developers performing a lot of the systems tasks, and since this isn’t always their core expertise, decisions are made, scripts are written, and “it works”.  When the team/systems grow large enough to need their own handler, in walks a system admin-style person, and may keel over, due to the state of affairs.

Yes, there are many tech companies where this is not the case, and I commend them of keeping their systems lean, mean and clean.

A lot of companies have figured out that in order to make the X:Y ratio work well, automation is required.  Here’s an article that covers some numbers from earlier this year.  I find that the statement of a ratio of 50 servers to 1 sysadmin pretty low on my view of how things can be, especially given the tools that we have available to us.

One of the popular systems configuration tools I’ve been using heavily is Chef, from Opscode. They provide a hosted solution, as well as an open-source version of their software, for anyone to use.  Getting up and running with some basics is really fast, and there’s a ton of information available, as well as a really responsive community (from mailing lists, bug tracker site and IRC channel).  Once you’re working with Chef, you may wonder how you ever got anything done before you had it.  It’s really treating a large part of your infrastructure as code – something readable, executable, and repeatable.

But this isn’t about getting started with Chef. It’s about “what’s next”.

In any decent starting-out tech company, the amount of servers used will typically range from 2-3 all the way to 200 – or even more.  If you’ve gone all the way to 200 without something like Chef or Puppet, I commend your efforts, and feel somewhat sorry for you.  Once you’re automating your systems creation, deployment and change, then you typically want some feedback on what’s going on. Did what I asked this system to do succeed, or did it fail.

Enter Datadog.

Datadog attempts to bring many sources of information together, to help whomever it is that is supposed to be looking at the systems to make more sense of the situation, from collecting metrics from systems, events from services and other sources, to allowing a timeline and newsfeed that is very human-friendly.

Having all the data at your disposal makes it easier to find patterns and correlations between events, systems and behaviors – helping to minimize the “what just happened?” question.

The Chef model for managing systems is a centralized server (either the open source in your environment or the hosted service in Opscode), which tells a server what it is meant to “be”.  Not what it is meant to “do now”, but the final state it should be in.  They call this model “idempotent” – meaning that no matter how many time you execute the same code on the same server, the behavior should end up the same every time.  But it doesn’t follow up very much on the results of the actions.

An analogy could be that every morning, before your kid leaves the house, your [wife|mother|husband|guardian|pet dragon] tells them “You should wear a coat today.” and then goes on their merry way, not checking whether they wore a coat or not. The next morning, there will get the same comment, and so on and so forth.

So how do we figure out what happened? Did the kid wear a hat or not? I suppose I could check by asking the kid and get the answer, but what if there are 200 of us? Do I have time to ask every kid whether or not they ended up wearing a hat? I’m going to be spending a lot of time dealing with this simple problem, I can tell you now.

Chef has built-in functionality to report on what Chef did – after it has received its instructions from the centralized server. It’s called the “Exception and Report Handlers” – and this is how I tie these two technologes together.

I adapted some code started by Adam Jacob @Opscode, and extended it further into a complete RubyGem with modifications for content, functionality and some rigorous testing.

Once the gem was ready, now I have to distribute it to my servers, and then have it execute every time Chef runs on that server. So, based on the chef_handler cookbook, I added a new recipe to the datadog cookbook – dd-handler.

What this does is adds the necessary components to a Chef execution, and when placed at the beginning of a “run”, will capture all the events and report back on the important ones to the Datadog newsfeed.  It will also push some metrics, like how long the Chef execution too, how many resources were updated, etc.

The process for getting this done was really quite simple, once you boil down all the reading, how’s and why’s – especially if you use git to version control your chef-repo.  The `knife cookbook site install` command is a great method for keeping your git repo “safe” for future releases, thus preserving your changes to the cookbook, allowing for merging of new code automatically. Read more here.

THE MOST IMPORTANT STUFF:

Here’s pretty much the process I used (under chef/knife version 0.10.x):

$ cd chef-repo
$ knife cookbook site install datadog
$ vi cookbooks/datadog/attributes/default.rb

At this point, I head over to Datadog, hit the “Setup” page, and grap my organization’s API Key, as well as create a new Application Key named “chef-handler” and copy the Hash that is created.

I place these two values into the `attributes/default.rb` file, save and close.

$ knife cookbook upload datadog

This places the cookbook on my Chef server, and is now ready to be referenced by a node or role. I use roles, as it’s much more manageable across multiple nodes.

I update the `common-node` role we have to include “recipe[datadog::dd-handler]” as one of the first receipes to execute in the run list.

The common-node role applies to all of our systems, and since they all run chef, I want them all to report on their progress.

And then let it run.

END MOST IMPORTANT STUFF

Since our chef-client runs on a 30 minute interval, and not all execute at the same time, this makes for some interesting graphs at the more recent time slices – not all the data comes in at the same time.  That’s something to get used to.

Here’s an image of a system’s dashboard with only the Chef metrics:

Single Instance dashboard
It displays a 24-hour period, and shows that this particular instance had a low variance in its execution time, as well as not much is being updated during this time (a good thing, since it is consistent).

On a test machine I tossed together, I created a failure, and here’s how it gets reported back to the newsfeed:

 

Testing a failure
As you can see, the stacktrace attempt to provide me with the information I need to diagnose and repair the issue. Once I fix it, and apache can start, this event was logged in the “Low Priority” section of the feed (since succeses are expected, and failures are aberrant behavior):

Test passes

All this is well and wonderful, but what about a bunch of systems? Well, I grabbed a couple snaps off the production environment for you!

These are aggregates I created with the graphing language (had never really read it before today!)

Production aggregate metrics

By being able to see the execution patterns, and a bump closer to the left side of the “Resource Updated” graph – I then investigated, and someone had deployed a new rsyslog package – so there was a temporary increase in deploying the resources, and now there are slightly more resources to manage overall.

The purple bump seen in the “Execution Time” graph led me to investigate, and found a timeout in that system’s call to an “apt-get update” request – probably the remote repo was unavailable for a minute. Having the data available to make that correlation made this task of investigating this problem really fast, easy, and simple – more importantly since it has been succeeding ever since, no cause for alarm.

So now I have these two technologies – Chef to tell the kids (the servers) to wear coats, and Datadog to tell the parents (me) if the kids wore the coats or not, and why.

Really, just wear a coat. It’s cold out there.

———–

Tested on:

  • CentOS 5.7 (x64), Ruby 1.9.2 v180, Chef 0.10.4
  • Ubuntu 10.04 (x64), Ruby 1.8.7 v352, Chef 0.9.18
Used:

Thanks, but no thanks, Verizon!

I guess Verizon think they know what’s best for me.

Recently got a nice little Verizon USB 760 Modem from work, not a new concept for me, just something to keep in touch while on the go.

Unfortunately, the Verizon Access Manager software most decidedly does NOT install correctly on my Mac. Instead, it tells me I’m not an administrator. Feels a lot like trying to install software on Windows Vista.

See the failure, how pretty it is...

Continue reading Thanks, but no thanks, Verizon!