Pour One Out for Bourbon 🥃

TL,DR: Thank You, thoughtbot

This post is a big ol’ Thank You post to the team at thoughtbot (a software consultancy company) for their generosity in creating and managing open source software projects over a lifetime of a project I work on a fair amount these days – PyPI.org, the largest free repository of Python packages.

Open Source Software (OSS) can be a force-multiplier for any software development project, as it reduces the need for the developer to write all of the code necessary to finish a task/project/website/etc. The motivation to share software as free and open for others to use can differ wildly from one person to the next, but the fact that anyone does is much appreciated, thanks again!

History Class

Back in 2013, when Macklemore & Ryan Lewis’ Thrift Shop was all the rage, is when I probably became aware of thoughtbot, as I was building up Datadog’s operations team and practices. We used Opscode Chef to manage the ever-growing complexity of our fleet of cloud instances. The earliest reference I can still dig up is my own blog, The Importance of Dependency Testing (generally still good advice!) and the super helpful tools and blogs they wrote to help folks understand not only the code they shared, but how they used it, and why.
I incorporated one of their tools into one of my own tools’ testing workflows, to help users of my tool have a better experience. It worked as advertised, and I used their tool again in early 2014 to help do the same for more “official” open source software.

While many of us have moved on from Chef software, I’ll never forget the #ChefFriends we made along the way. 🥰


Narrator: Mike wasn’t part of this next bit, the details might be a bit wrong, sorry!

In 2015, the Python Software Foundation (PSF) was engaged in rebuilding PyPI to meet new demands, launching in 2018. Beyond the new underlying functionality, Nicole Harris led the user interface redesign. You can still see her mock prototype and its inner workings – most of which remains active in PyPI today.

A couple of OSS libraries were copied into the codebase back then – bourbon and neat – two projects that made writing Sass (SCSS) style-related code a little easier – both of which are now deprecated. PyPI briefly included the third project, bitters, but removed it prior to production launch as it wasn’t needed.
These tools were especially important during the times of the second browser war, providing developers simpler methods to create layouts, mix in functionality that would create better browser support for end users, despite lack of feature parity across all browsers, leading to a better end-user experience.

PyPI’s new design needed a grid layout, which wasn’t a standard yet, and neat provided clean way to make a grid layout in CSS work properly for browsers of that era (Chrome 10+, Firefox 29+, Internet Explorer 9+, Safari 5.1+, et al). If I’m reading the source correctly, neat had functions to create visual tables and place items correctly. bourbon provided behaviors (mixins and selectors) like @clearfix, and @ellipsis, utilities to prevent the need to repeat a given set of statements. Both libraries without having to write a lot of custom code to work well across browsers.

Back in 2015 this amounted to ~7,000 lines of code, and thanks to thoughtbot’s sharing of libraries they developed internally to help their clients, PyPI has a good user interface, launching to the public in 2018.

Getting into the game

I started contributing to PyPI’s codebase in 2021, with typo fixes, documentation updates, and development environment improvements – common ways I’ve found are usually welcome to OSS projects, as they are easily overlooked by the current maintainers, who are often busy with other things and overlook when an instruction is wrong. I am still occasionally tickled by the fact that my path to working at the Python Software Foundation started with writing JavaScript and CSS.

Some of my early work was dropping support for Internet Explorer 11, and as I learned more about the codebase, I spotted the bourbon/neat inclusions. In early 2023 I mentioned that these were no longer maintained as of late 2019. PyPI was using bourbon 4.2.5 and neat 1.7.2 at that time, both released in early 2015.

Over time, neat 2.0 changed its external-facing API, and bourbon 5.0 did the same for some functions. Due to how bourbon and neat were included in the repository – copied the source, not via an installation package, commonly known as “vendoring” – made updates a little harder to reason about, especially since minor changes had been added to the vendored code over time. (No criticism to the vendoring method, it worked with the tools and approaches for that era!)

PyPI’s versions were effectively “stuck” in 2015, and what was there worked well, but didn’t take advantage of the evolving browser standards that changed the need for these libraries over time.

Do the thing

I recently decided to spend a weekend day to see how much I could remove/replace to continue to keep PyPI style code modern, working, and maintainable.

I learned that the only functionality PyPI still used from neat was @span-columns(), and that could be replaced with a relatively small custom mixin using native CSS Grid (specifically grid-template-columns, grid-column, and column-gap), all widely available since 2018. Replacing ~1,100 lines of neat with this SCSS was pretty straightforward:

$grid-columns: 12;

@mixin grid-container {
  display: grid;
  grid-template-columns: repeat($grid-columns, 1fr);
  column-gap: $half-spacing-unit;
}
An example of the using the replacement mixin

The resulting compiled CSS was effectively the same visual behavior, but now used native CSS styles and is identified as browsers as using grid syntax, which opens up the ability to further transform to more responsive layouts in the future if needed.

On the bourbon side, thoughtbot had produced a Migrating from v4 to v5 guide in 2018, which included helpful suggestions on replacements or adaptations. In the intervening years, bourbon was updated to version 7.3.0, and then finally deprecated, and thoughbot again shared a guide for replacing it with alternative approaches earlier this year.

Using this guide, and now more familiar with the SCSS code used, I was able to update bourbon to the final version, and then fully replace any used functionality remaining with adequate replacements, removing another ~4,400 lines of code from the codebase.

Finishing the job

Testing website visual layouts is a whole discipline on its own, and PyPI doesn’t have a test suite that covers visual layouts, but we have a pretty robust local development environment. I used that to validate that each change I made had the desired affect as I was making them. Thankfully, most of the usages were hyper-specific to a given page, so there were only a handful of places to confirm manually.

Once I was done with all my changes, I captured the compiled distribution CSS file and ran a semantic difference checker against the file from before the changes and after. The open source tool I used was difftastic , which was able to point out the differences between the two compiled, minimized CSS files, and help me confirm that the only changes were the ones I expected. Thanks Wilfred!

Once done, I opened a pull request for review, and it made it out to production with no issues. 🥳

What have we learned, Mike?

Well, first and foremost, open source is continuing to grow, so big thanks to any and all who share their code with the world and take the time to maintain. Truly, we are stronger together.

Let’s give a round of applause to the team at thoughtbot. While I generally write less Ruby/JavaScript/CSS these days, and their code is no longer inside PyPI, I know that when I come across these folks’ tools or libraries, it’s likely to be a cut above others on the quality spectrum. Their approach is what I’d consider exemplary on how to help user navigate changes to open source from “cradle to grave” – providing essential, useful software for others, showing them how to use it, update it, and finally retire it from use. PyPI benefited from these projects for ~9 years, until no longer needed, so we salute them! 🫡

And finally, consider this a call to action to anyone out there to get involved in open source. For some ideas and approaches, check out the chat I had with my fellow AWS Hero Chris Williams last year on vBrownBag:

Thanks for reading, now go out and get involved!

Reduce AWS Lambda Latencies with Keep-Alive in Python

It starts, as many stories do, with a question. On September 10th, AWS Serverless Hero Luc van Donkersgoed shared his observations on the relationship of reduced latency with increased request rate when using AWS Lambda. This is always an interesting conversation, and sure enough other AWS Heroes like myself are curious about some of the outlier behaviors, and what exactly is going into each request. AWS Data Hero Alex DeBrie, AWS Container Hero Vlad Ionescu both ask excellent questions about the setup and the behaviors, leading Luc to share what he’s seeing with regards to DNS lookups that don’t make sense to him.

After asking a couple of more questions of my own, I rolled up my sleeves and dug into the what, how, and why.

getting ready to read things and hit them with sticks

I dive in to all parts of the stack in use to try and understand why Luc’s code is seeing DNS lookups.
For example, if your function needs to call AWS S3 or a Twilio API, we usually provide the domain name, and have the code or library perform a request to a Domain Name System (DNS) server to return the current IP address, and then communicate using the IP address. This is a network call and can be expensive (in milliseconds) if it’s performed more frequently than the DNS response’s Time To Live (TTL) – kind of like an expiration date. The DNS lookup adds some more latency to your overall call, which is why many systems will cache DNS responses until the TTL is expired, and then make a new call. If you perform DNS lookups when not needed, that’s adding latency unnecessarily. Read the tweet thread for more!

I arrive at two possible solutions:

  1. If the Python code calls more than 10 AWS service endpoints, it will trigger a DNS lookup, as urllib3‘s PoolManager will only maintain 10 connections (set by botocore defaults) and will need to recycle if exceeded.
  2. Since we’re unlikely to be hitting the limit of 10, something else is at play.
    I found that the default behavior of boto3 is to not use Keep Alive, thus explaining why the occasional connection is reset, triggering a DNS lookup. (Read the tweet thread for the full discovery.)

Using Keep-Alive is nothing new, and was covered quite well by AWS Serverless Hero Yan Cui back in 2019 for Node. It’s even in the official AWS Documentation, citing Yan’s article for the proposed update. Thanks Yan!


There’s precious little literature on using Keep Alive for Python Lambdas that I could find, leading to issues like Luc’s and reports like this one, so I decided to dig a little further. Knowing now that the default for Keep Alive is off by default for users of the popular boto3 package to interact with AWS services, I wanted to explore what that looks like in practice.

I decided to pattern an app after Yan’s example – a function at receives an event body, and persists it to DynamoDB. All in all, not a too complex operation – we perform a single DNS lookup for the DynamoDB service endpoint, and then use the response IP address to connect over HTTP to put an object into the DynamoDB table.

After re-writing the same function in Python, I was able to test the same kind of behavior that Yan did, running a call to the function once per second, isolating any concurrency concerns, replicating Luc’s test. This should have the benefit of reusing the same Lambda context (no cold starts) and seeing that the latencies range from 7 to 20 milliseconds for the same operation:

filtered log view showing only the latency for put_item calls to DynamoDB for 30 seconds

So far, so good – pretty much the same. The overall values are lower than Yan’s original experiment, which I attribute to the entire Lambda ecosystem improving, but we can see there’s variance and we often enter double-digit latencies, when we know that the DynamoDB operation is likely to only take 6-7 milliseconds.

left side shows spiky responses; right side shows most responses are fast, with some slower outliers

As Yan showed in his approach adapted from Matt Levine’s talk snippets, he was able to reconstruct the AWS Config by rebuilding the lowest-level HTTP agent that the library relies on to make the calls, and thereby set the behavior for Keep Alive. This has since been obsoleted by the AWS Node.JS SDK adding an environment variable to enable the keep alive behavior, which is awesome! But what about Python? 🐍

In the recent release of botocore 1.27.84 we can modify the AWS Config passed into the client constructor:

# before:
import boto3
client = boto3.client("dynamodb")

# after:
import boto3
from botocore.config import Config
client = boto3.client("dynamodb", config=Config(tcp_keepalive=True))

With the new configuration in place, if you try this on AWS python3.9 execution runtime, you’ll get this error:
[ERROR] TypeError: Got unexpected keyword argument 'tcp_keepalive'

While the AWS Python runtime includes versions of boto3 and botocore, they do not yet support the new tcp_keepalive parameter – the runtime currently ships:
– boto3 1.20.32
– botocore 1.23.32

So we have to solve another way.

The documentation tells us that we can configure this via a config file in ~/.aws/config, added in version 1.9.17 back in October 2018 – presumably when all the Keep Alive conversations were fresh in folks’ minds.

However, since the Lambda runtime environment disallows writing to that path, we can’t write the config file easily. We might be able to create a custom Docker runtime and place a file in the path, but that’s a bit harder, and we lose some of the benefits of using the AWS prebuilt runtime like startup latency, which when we’re exploring a latency-oriented article, seems like the wrong choice 😁.

Using serverless framework CLI with the serverless-python-requirements (what I’m currently using), or AWS SAM, you can add the updated version of boto3 and botocore, and deploying the updated application allows us to leverage the new setting in a Lambda environment. You may already be using one of these approaches for a more evolved application.
Hopefully 🤞 the Lambda Runtime will be updated to include these versions in the near future, so we don’t have to package these dependencies to get this specific feature.

With the updated packages, we can pass the custom Config with tcp_keepalive enabled (as shown above), and observe more constant performance for the same style of test:

left: much smoother!! right: narrower distribution of values, max 8.50 ms

There’s an open request for the config value to be available via environment variable – check it out and give it a 👍 to add your desire and subscribe via GitHub notifications.

Enjoy lower, more predictable latencies with Keep Alive!

Check out the example code here: https://github.com/miketheman/boto3-keep-alive


Postscript: If you’re interested in pinpointing calls for performance, I recommend checking out Datadog’s APM and associated ddtrace module to see the specifics of every call to AWS endpoints and associated latencies, as well as other parts of your application stack. There’s a slew of other vendors that can help surface these metrics.

Container-to-Container Communication

Question ❓

In a containerized world, is there a material difference between communicating over local network TCP vs local Unix domain sockets?

Given an application with more than a single container that need to talk to each other, is there an observable difference in latency/throughput when using one inter-component communication method over another from an end-users’ perspective?


Background 🌆

There’s this excellent write-up on the comparison back in 2005, and many things have changed since then, especially around the optimizations in the kernel and networking stack, along with the container runtime that is usually abstracted away from the end user’s concerns. Redis benchmarks from a few years ago also point out significant improvements using Unix sockets when the server and benchmark are co-located.

There’s other studies out there that have their own performance comparisons, and produce images like these – and every example is going to have its own set of controls and caveats.

I wanted to use a common-ish scenario: a web service running on cloud infrastructure I don’t own.

Components 🧩

For the experiment, I chose this set of components:

  • nginx (web server) – terminate SSL, proxy requests to upstream web server
  • gunicorn (http server) – speaks HTTP and WSGI protocol, runs application
  • starlette (python application framework) – handle request/response
components

I considered using FastAPI for the application layer – but since I didn’t need any of those features, I didn’t add it, but it’s a great framework – check it out!

As gunicorn server runs the starlette framework and the custom application code, I will be referring to them as a single component later as "app", as the tests I’m comparing is the behavior between nginx and the "app" layer, using overall user-facing latency and throughput as the main result.

nginx 🌐

nginx is awesome. Really powerful, and has many built-in features, highly configurable. Been using it for years, and it’s my go-to choice for a reliable web server.

For our purposes, we need an external port to listen for inbound requests, and a stanza to proxy the requests to the upstream application server.

You might ask: Why use nginx at all, if Gunicorn can terminate connections directly? Well, there’s often a class of problems that nginx is better suited at handling rather than a fully-fledged Python runtime – examples include static file serving (robots.txt, favicon.ico et. al.) as well as caching, header or path rewriting, and more.

nginx is a commonly used in front of all manner of applications.

Python Application 🐍

To support the testing of a real-world scenario, I’m creating a JSON response, as that’s how most web applications communicate today. This often incurs some serialization overhead in the application.

I took the example from starlette and added a couple of tweaks to emit the current timestamp and a random number. This prevents any potential caching occurring in any of the layers and polluting the experiment.

Here’s what the main request/response now looks like:

async def homepage(request):
    return JSONResponse(
        {
            "hello": "world",
            "utcnow": datetime.datetime.utcnow().isoformat(),
            "random": random.random(),
        }
    )

A response looks like this:

{
  "hello": "world",
  "utcnow": "2021-12-27T00:31:42.383861",
  "random": 0.5352573557347882
}

And while there are ways to improve JSON serialization speed, or tweak the Python runtime, I wanted to keep the experiment with defaults, since the point isn’t about maximizing total throughput, rather seeing the difference between the architectures.

Cloud Environment ☁️

For this experiment, I chose Amazon Elastic Container Service (ECS) with AWS Fargate compute. These choices provide a way to construct all the pieces needed in a repeatable fashion in the shortest amount of time, and abstract a lot of the infra concerns. To set everything up, I used AWS Copilot CLI, an open-source tool that does even more of the heavy lifting for me.

The Copilot Application type of Load Balanced Web Service will create an Application Load Balancer (ALB), which is the main external component outside my application stack, but an important one for actual scaling, SSL termination at the edge, and more. For the sake of this experiment, we assume (possibly incorrectly!) that ALBs will perform consistently for each test.

Architectures 🏛

Using containers, I wanted to test multiple architecture combinations to see which one proved the "best" when it came to user-facing performance.

Example 1: "tcp"

The communication between nginx container and the app container takes places over the dedicated network created by the Docker runtime (or Container Network Interface in Fargate). This means there’s TCP overhead between nginx and the app – but is it significant? Let’s find out!

Example 2: "sharedvolume"

Here we create a shared volume between the nginx container and the app container. Then we use a Unix domain socket to communicate between the containers using the shared volume.

This architecture maintains a separation of concerns between the two components, which is generally a good practice, so as to have a single essential process per container.

Example 3: "combined"

In this example, we combine both nginx and app in a single container, and use local Unix sockets within the container to communicate.

The main difference here is that we add a process supervisor to run both nginx and app runtimes – which some may consider an anti-pattern. I’m including it for the purpose of the experiment, mainly to uncover if there’s performance variation between a local volume and a shared volume.

This approach simulates what we’d expect in a single "server" scenario – where a traditional instance (hardware or virtual) runs multiple processes and all have some access to a local shared volume for inter-process communication (IPC).

To make this a fair comparison, I’ve also doubled the CPU and memory allocation.

Copilot ✈️

Time to get off the ground.

Copilot CLI assumes you already have an app prepared in a Dockerfile. The Quickstart has you clone a repo with a sample app – so instead I’ve created a Dockerfile for each of the architectures, along with a docker-compose.yml file for local orchestration of the components.

Then I’ll be able to launch and test each one in AWS with its own isolated set of resources – VPC, networking stack, and more.

I’m not going into all the details of how to install Copilot and launch the services, for that, read the Copilot CLI documentation (linked above), and read the experiment code.

This test is using AWS Copilot CLI v1.13.0.

Test Protocol 🔬

There’s an ever-growing list of tools and approaches to benchmark web request/response performance.

For the sake of time, I’ll use a single one here, to focus on the comparison of the server-side architecture performance.

All client-side requests will be performed from an AWS CloudShell instance running in the same AWS Region as the running services (us-east-1) to isolate a lot of potential network chatter. It’s not a perfect isolation of potential variables, but it’ll have to do.

To baseline, I ran each test locally (see later).

Apache Bench

Apache Bench, or ab, is a common tool for testing web endpoints, and is not specific to Apache httpd servers. I’m using: Version 2.3 <$Revision: 1879490 $>

I chose single concurrency, and ran 1,000 requests. I also ignore variable length, as the app can respond with a variable-length random number choice, and ab considers different length responses a failure unless specified.

ab -n 1000 -c 1 -l http://service-target....

Each test should take less than 5 seconds.

The important stats I’m comparing are:

  • Requests per second (mean) – higher is better
  • Time per request (mean) – lower is better
  • Duration at 99th percentile. 99% of all requests completed within (milliseconds) – lower is better

To reduce variance, I also "warmed up" the container by running the test for a larger amount of requests

Local Test

To establish a baseline, I ran the same benchmark test against the local services. Using Docker Desktop 4.3.2 (72729) on macOS. These aren’t demonstrative of a real user experience, but provides a sense of performance before launching the architectures in the cloud.

arch reqs per sec ms per req 99th pctile
tcp (local) 679.77 1.471 2
sharedvolume (local) 715.62 1.397 2
combined (local) 705.55 1.871 2

In the local benchmark, the clear loser is the tcp architecture, and the sharedvolume has a slight edge on combined – but not a huge win. No real difference in the 99th percentiles – requests are being served in under 2ms.

This shows that the shared resources for the combined architecture are near the performance of the sharedvolume – possibly due to Docker Desktop’s bridging and network abstraction. A better comparison might be tested on a native Linux machine.

Remote Test

Once I ran through the setup steps using Copilot CLI to create the environment and services, I performed the same ab test, and collected the results in this table:

arch reqs per sec ms per req 99th pctile
tcp (aws) 447.57 2.234 5
sharedvolume (aws) 394.55 2.535 6
combined (aws) 428.60 2.333 4

With the remote tests, minor surprise that the combined service performed better than the sharedvolume service, as in the local test it performed worse.

The bigger surprise was to find that the tcp architecture wins slightly over the socket-based architectures.

This could be due to the way ECS Fargate uses the Firecracker microvm, and has tuned the network stack to perform faster than using a shared socket on a volume when communicating between two containers on the same host machine. The best part is – as a consumer of a utility, I don’t care, as long as it’s performing well!

ARM/Graviton Remote Test

With the Copilot manifest defaults for the Intel x86 platform, let’s also test the performance on the linux/arm64 platform (Graviton2, probably).

For this to work, I had to rebuild the nginx sidecars manually, as Copilot doesn’t yet build&push sidecar images. I also had to update the manifest.yml to set the desired platform, and deploy the service with copilot svc deploy .... (The combined version needed some Dockerfile surgery too…)

Results:

arch reqs per sec ms per req 99th pctile
tcp (aws/arm) 475.03 2.105 3
sharedvolume (aws/arm) 451.71 2.214 4
combined (aws/arm) 433.94 2.304 4

We can see that all the stats are better on the Graviton architecture, lending some more credibility to studies done by other benchmark posts and papers.

Aside: The linux/arm64-based container images were tens of megabytes smaller, so if space and network pull time is a concern, these will be a few microseconds faster.

Other Testing Tools

If you’re interested in performing longer tests, or emulating different user types, check out some of these other benchmark tools I considered and didn’t use for this experiment:

  • Python – https://locust.io/ https://molotov.readthedocs.io/
  • JavaScript – https://k6.io/
  • Golang – https://github.com/rakyll/hey
  • C – https://github.com/wg/wrk

There’s also plenty of vendors that build out extensive load testing platforms – I’m not covering any of them here. If you run a test with these, would definitely like to see your results!

Conclusions 💡

Using the Copilot CLI wasn’t without some missteps – the team is hard at work improving the documentation, and are pretty responsive in both their GitHub Issues and Discussions, as well as their Gitter chat room – always helpful when learning a new framework. Once I got the basics, being able to establish a reproducible stack is valuable to the experimentation process, as I was able to provision and tear down the stack easily, as well as update with changes relatively easily.

Remember: these are micro-benchmarks, on not highly-tuned environments or real-world workloads. This test was designed to test a very specific type of workload, which may change as more concurrency is introduced, CPU or memory saturation is achieved, auto-scaling of application instances comes into play, and more.

Your mileage may vary.

When I started this experiment, I assumed the winner would be a socket-based communication architecture (sharedvol or combined), from existing literature, and it also made sense to me. The overhead of creating TCP packets between the processes would be eliminated, and thus performance would be better.

However, in these benchmarks, I found that using the TCP communication architecture performs best, possibly due to optimizations beyond our view in the underlying stack. This is precisely what I want from an infrastructure vendor – for them to figure out how to optimize performance without having to re-architect an application to perform better in a given deployment scenario.

The main conclusion I’ve drawn is: Using TCP to communicate between containers is best, as it affords the most flexibility, follows established patterns, and performs slightly better than the alternatives in a real(ish) world scenario. And if you can, use Graviton2 (ARM) CPU architecture.

Go forth, test your own scenarios, and let me know what you come up with. (Don’t forget to delete your resource when done!! 💸 )

AWS DeepComposer 🎹➡️☁️🎶

This year’s Amazon Web Services re:Invent conference in Las Vega, Nevada, was a veritable smorgasbord of announcements, product launches, previews, and a ton of information to try and digest at once.

One very exciting announcement was AWS DeepComposer – which continues to expand on AWS’ mission of “Putting machine learning in the hands of every developer”.
Here’s a slick intro video from the product announcement – come back after!

The service is still in Preview mode, and has an application/review process – so while I wait for the application to clear, I figured I’d poke around a bit and see what I got.

📦 Box Contents

The box. Not super impressive.
The box, open. More impressive.

Opening the box, I’m immediately reminded of a 1980s Casio Keyboard – we had one, and I enjoyed it a lot. This is larger, has no batteries or speakers.

The keyboard itself.

It’s a 32-key keyboard, while the key sizing isn’t 100% the same as that baby grand piano you have tucked somewhere in your vast mansion, it’ll probably be good enough.

The interface is USB Type B. I recently recycled roughly over 20 of these cables in an e-waste purge, thinking “I don’t have anything that uses this connection!” Well, now I do. It’s 2019 – I thought at least Micro USB, if not USB-C would have been the right choice?

Lucky for me, the box also contains a USB-A to USB-B cable, so at least that’s that.
Wait a minute… my 12-inch MacBook from 2016 that I’m using only has a single USB-C port.
Ruh-roh.
Apparently, I packed my USB-A to USB-C plug that I got with my Google Pixel 4 – let’s see if that will work! Even if it does, that means that I can’t use the DeepComposer and charge my laptop at the same time without an external port hub.
Considering that’s the only port (other than a 3.5mm audio jack) on my mac, I’m not too worried about it, especially since the battery is still pretty good.

There’s other packing materials, and a little card with a nice tagline of “Press play on ML” and a URL to visit: https://aws.amazon.com/startcomposing (redirects to the product page link – maybe a future device-specific landing page? Hmmm…)

⚡️ Power it up

I know I don’t have the provisioned account access yet, so I won’t be able to run all the things the presenter did in the video, so I figured I might poke around the connectivity interface and see what I might be able to glean in the absence of a proper setup.

Before I plug in the device, let’s also look at the current state of the Input/Output (I/O) devices, filtered specifically to the Apple USB Host Controller:

$ ioreg -w0 -rc AppleUSBHostController
+-o XHC1@14000000  <class AppleUSBXHCISPTLP, id 0x1000001dd, registered, matched, active, busy 0 (5263 ms), retain 55>
  | {
  |   "IOClass" = "AppleUSBXHCISPTLP"
  |   "kUSBSleepPortCurrentLimit" = 1500
  |   "IOPowerManagement" = {"ChildrenPowerState"=1,"DevicePowerState"=0,"CurrentPowerState"=1,"CapabilityFlags"=4,"MaxPowerState"=3,"DriverPowerState"=0}
  |   "IOProviderClass" = "IOPCIDevice"
  |   "IOProbeScore" = 1000
  |   "UsbRTD3Supported" = Yes
  |   "locationID" = 335544320
  |   "name" = <"XHC1">
  |   "64bit" = Yes
  |   "kUSBWakePortCurrentLimit" = 1500
  |   "IOPCIPauseCompatible" = Yes
  |   "device-properties" = {"acpi-device"="IOACPIPlatformDevice is not serializable","acpi-path"="IOACPIPlane:/_SB/PCI0@0/XHC1@140000"}
  |   "IOPCIPrimaryMatch" = "0x9d2f8086"
  |   "IOMatchCategory" = "IODefaultMatchCategory"
  |   "CFBundleIdentifier" = "com.apple.driver.usb.AppleUSBXHCIPCI"
  |   "Revision" = <0003>
  |   "IOGeneralInterest" = "IOCommand is not serializable"
  |   "IOPCITunnelCompatible" = Yes
  |   "controller-statistics" = {"kControllerStatIOCount"=78,"kControllerStatPowerStateTime"={"kPowerStateOff"="142ms (0%)","kPowerStateSleep"="40191894ms (99%)","kPowerStateOn"="75024ms (0%)","kPowerStateSuspended"="1332ms (0%)"},"kControllerStatSpuriousInterruptCount"=0}
  |   "kUSBSleepSupported" = Yes
  | }
  |
  +-o HS01@14100000  <class AppleUSB20XHCIPort, id 0x100000245, registered, matched, active, busy 0 (4773 ms), retain 13>
  +-o HS03@14200000  <class AppleUSB20XHCIPort, id 0x100000246, registered, matched, active, busy 0 (0 ms), retain 10>
  +-o HS04@14300000  <class AppleUSB20XHCIPort, id 0x100000249, registered, matched, active, busy 0 (0 ms), retain 10>
  +-o HS09@14400000  <class AppleUSB20XHCIPort, id 0x10000024c, registered, matched, active, busy 0 (0 ms), retain 9>
  +-o SSP1@14500000  <class AppleUSB30XHCIPort, id 0x10000024d, registered, matched, active, busy 0 (0 ms), retain 14>
  +-o SSP3@14600000  <class AppleUSB30XHCIPort, id 0x10000024e, registered, matched, active, busy 0 (0 ms), retain 12>
  +-o SSP4@14700000  <class AppleUSB30XHCIPort, id 0x10000024f, registered, matched, active, busy 0 (0 ms), retain 12>

A shorter version of this can be seen in the built-in System Information app, under the USB section.

Now I’m ready – let’s see what happens!

Plugging in, the first positive indication is that I see a series of red and blue LEDs briefly light up behind the top row of buttons, a quick cycle. So we know that at the very least, the little adapter is providing some power to the USB device.

Let’s look at the output of the I/O device state now:

$ ioreg -w0 -rc AppleUSBHostController
+-o XHC1@14000000  <class AppleUSBXHCISPTLP, id 0x1000001dd, registered, matched, active, busy 0 (7030 ms), retain 60>
  | {
  |   "IOClass" = "AppleUSBXHCISPTLP"
  |   "kUSBSleepPortCurrentLimit" = 1500
  |   "IOPowerManagement" = {"ChildrenPowerState"=3,"DevicePowerState"=2,"CurrentPowerState"=3,"CapabilityFlags"=32768,"MaxPowerState"=3,"DriverPowerState"=0}
  |   "IOProviderClass" = "IOPCIDevice"
  |   "IOProbeScore" = 1000
  |   "UsbRTD3Supported" = Yes
  |   "locationID" = 335544320
  |   "name" = <"XHC1">
  |   "64bit" = Yes
  |   "kUSBWakePortCurrentLimit" = 1500
  |   "IOPCIPauseCompatible" = Yes
  |   "device-properties" = {"acpi-device"="IOACPIPlatformDevice is not serializable","acpi-path"="IOACPIPlane:/_SB/PCI0@0/XHC1@140000"}
  |   "IOPCIPrimaryMatch" = "0x9d2f8086"
  |   "IOMatchCategory" = "IODefaultMatchCategory"
  |   "CFBundleIdentifier" = "com.apple.driver.usb.AppleUSBXHCIPCI"
  |   "Revision" = <0003>
  |   "IOGeneralInterest" = "IOCommand is not serializable"
  |   "IOPCITunnelCompatible" = Yes
  |   "controller-statistics" = {"kControllerStatIOCount"=104,"kControllerStatPowerStateTime"={"kPowerStateOff"="142ms (0%)","kPowerStateSleep"="40554314ms (99%)","kPowerStateOn"="245721ms (0%)","kPowerStateSuspended"="1333ms (0%)"},"kControllerStatSpuriousInterruptCount"=0}
  |   "kUSBSleepSupported" = Yes
  | }
  |
  +-o HS01@14100000  <class AppleUSB20XHCIPort, id 0x100000245, registered, matched, active, busy 0 (6540 ms), retain 18>
  | +-o AKM322@14100000  <class IOUSBHostDevice, id 0x100004670, registered, matched, active, busy 0 (1766 ms), retain 23>
  |   +-o AppleUSBHostLegacyClient  <class AppleUSBHostLegacyClient, id 0x100004673, !registered, !matched, active, busy 0, retain 9>
  |   +-o AppleUSBHostCompositeDevice  <class AppleUSBHostCompositeDevice, id 0x10000467b, !registered, !matched, active, busy 0, retain 4>
  |   +-o IOUSBHostInterface@0  <class IOUSBHostInterface, id 0x10000467d, registered, matched, active, busy 0 (3 ms), retain 6>
  |   +-o IOUSBHostInterface@1  <class IOUSBHostInterface, id 0x10000467e, registered, matched, active, busy 0 (3 ms), retain 6>
  +-o HS03@14200000  <class AppleUSB20XHCIPort, id 0x100000246, registered, matched, active, busy 0 (0 ms), retain 10>
  +-o HS04@14300000  <class AppleUSB20XHCIPort, id 0x100000249, registered, matched, active, busy 0 (0 ms), retain 10>
  +-o HS09@14400000  <class AppleUSB20XHCIPort, id 0x10000024c, registered, matched, active, busy 0 (0 ms), retain 9>
  +-o SSP1@14500000  <class AppleUSB30XHCIPort, id 0x10000024d, registered, matched, active, busy 0 (0 ms), retain 14>
  +-o SSP3@14600000  <class AppleUSB30XHCIPort, id 0x10000024e, registered, matched, active, busy 0 (0 ms), retain 12>
  +-o SSP4@14700000  <class AppleUSB30XHCIPort, id 0x10000024f, registered, matched, active, busy 0 (0 ms), retain 12>

Again, this is pretty verbose, but if you look closely, you’ll see that the device at address HS01@14100000 now has a sub-device associated with it – AKM322@14100000.

Yay! We can see that the device is powered, and the system registers it.

What is this thing??

A quick search for the device prefix string “AKM322” brought be to a device similar in nature:
https://www.amazon.com/midiplus-32-Key-Keyboard-Controller-AKM322/dp/B016O5F2GQ
Here’s the listing for the DeepComposer device: https://www.amazon.com/AWS-DeepComposer-learning-enabled-keyboard-developers/dp/B07YGZ4V5B/

If you’re asking – “why the price difference?”, well the DeepComposer device comes with some cloud features too!

We want you to know:
To train your models and create new musical compositions, AWS DeepComposer is priced at $99, this includes the keyboard, plus a 3-month free trial of AWS DeepComposer services to train your models and create original musical compositions. Each month of the free trail includes enough to cover up to 4 training jobs and 40 inference jobs per month, during the free trial period.

So for the dollar value, you’re getting not only the device, but also some AWS Cloud Goodness!

Visiting what appears to be the manufacturer’s page, we can see more details about the hardware, so that’s cool. It’s a MIDI device, translating analog signals (like pressing keys for with different pressures and durations) into digital signals.
Cool stuff! There might be some secret AWS goodness in the DeepComposer model – we’ll have to wait and see.

Make some noise!!

Again, I don’t yet have access to the DeepComposer interface, so I found a macOS MIDI testing guide that I followed: https://support.apple.com/en-gb/HT201840

The test was successful, but I only got a single note “ding” response, confirming that the device works, can communicate back to my computer. But I want to hear something!

Apple produces Logic Pro – but at a $199 price tag, I don’t really want to spend that just to mess around until I can really try out the DeepComposer service.
Apple also produces GarageBand – for free! Fire it up, and wait for the 2GB download to complete over hotel wifi. This is also where I unplug the keyboard, and plug in the power – since we’re going to be here for a while…

I’ll check back once I’ve got some more details to report. Hope you enjoyed this set of musings, and hopefully I’ll have more to show you soon!

Other Reading

There’s not too much out there just yet – as this is a preview service, just announced.
I posted a link to a video of the original announcement, and you can also read some of the announcement blog post details here:

https://aws.amazon.com/blogs/aws/aws-deepcomposer-compose-music-with-generative-machine-learning-models/

Extending ECS Auto-scaling for under $2/month with Lambda

The Problem

Amazon Web Services (AWS) is pretty cool. You ought to know that by now. if you don’t, take a few hours and check out some tutorials and play around.

One of the many services AWS provides is the EC2 Container Service (ECS), where the scheduling and lifecycle management of running Docker containers is handled by the ECS control plane (probably magic cooked up in Seattle over coffee or in Dublin over a pint or seven).

You can read all about its launch here.

One missing feature from the ECS offering in comparison to other container schedulers was the concept of scheduling a service to be run on each host in a cluster, such as a logging or monitoring agent.
This feature allows clusters to grow or shrink and still have the correct services running on each node.

A published workaround was to have each node individually run an instance of the defined task on startup, which works pretty well.

The downside here is is that if a task definition changes, ECS has no way of triggering an update to the running tasks – normal services will stop then start the task with a new definition, and use your logic to maintain some degree of uptime.
To achieve the update, one must terminate/replace the entire ECS Container Instance (the EC2 host) and if you’re using AutoScalingGroups, get a fresh node with the updated task.

Other Solutions

  • Docker Swarm calls this a global service, and will run one instance of the service on every node.
  • Mesos’ Marathon doesn’t support this yet either, and is in deep discussion on GitHub on how to implement this in their constraints syntax.
  • Kubernetes has a DaemonSet to run a pod on each node.
  • The recently-released ECS-focused Blox provides a daemon-scheduler to accomplish this, but brings along extra components to accomplish the scheduling.

Back to ECS

So imagine my excitement when the ECS team announced the release of their new Task Placement Strategies last week, offering a “One Task Per Host” strategy as part of the Service declaration.
This indeed is awesome and works as advertised, with no extra components, installs, schedulers, etc.

However! Currently each Service requires a “Desired Count” parameter of how many instances of this service you want to run in the cluster.

Given a cluster with 5 ECS Container Instance hosts, setting the Desired Count to 5 ensures that one runs on each host, provided there are resources available (cpu, ram, available port).

If the cluster grows to 6 (autoscaling, manually adding, etc), there’s nothing in the Service definition that will increase the desired count to 6, so this solution is actually worse off than our previous mode of using user-data to run the task at startup.

One approach is to arbitrarily raise the Desired count to a very high number, such as 100 for this cluster, with the consideration that we are unlikely to grow the cluster to this size without realizing it.
The scheduler will periodically examine the cluster for placement, and handle any hosts missing the service.

The problem with this is that it’s not deterministic, and CloudWatch metrics will report these unplaced tasks as Pending, and I have alarms to notify me if tasks aren’t placed in clusters, as this can point to a resource allocation mismatch.

Enter The Players

To accomplish an automated service desired count, we must use some elements to “glue” a few of the systems together with our custom logic.

Here’s a sequence diagram of the conceptual flow between the components.

UML Sequence Flow

Every time there is a change in an ECS Cluster, CloudWatch Events will receive a payload.
Based on a rule we craft to select events classified as “Container Instance State Change”, CW Events will emit an event to the target of your choice, in our case, Lambda.

We could feasibly use a cron-like schedule to fire this every N minutes to inspect, evaluate, and remediate a semi-static set of services/cluster, but having a system that is reactive to change feels preferable to poll/test/repair.

A simple rule that captures all Container Instance changes:

{
  "source": [
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Container Instance State Change"
  ]
}

You can restrict this to specific clusters by adding the cluster’s ARN to the keys like so:

  "detail": {
    "clusterArn": [
      "arn:aws:ecs:us-east-1:123456789012:cluster/my-specific-cluster",
      "arn:aws:ecs:us-east-1:123456789012:cluster/another-cluster"
    ]
  }

If being throttled or cost is a concern here, you may wish to filter to a set of known clusters, but this reduces the reactiveness of the logic to new clusters being brought online.

The Actual Logic

The Lambda function receives the event, performs some basic validation checks to ensure it has enough details to proceed, and then makes a single API call to the ECS endpoint to find our specified service in the cluster that fired the change event.

If no such service is found, we terminate now, and move on.

If the cluster does indeed have this service defined, then we perform another API call to describe the count of registered container instances, and compare that with the value we already have from the service definition call.

If there’s a mismatch, we perform a final third API call to adjust the service definition’s desired task count.

All in all, a maximum total of 3 possible API calls, usually in under 300ms.

In my environment, I want this task to apply to every cluster in my account, as we later on inspect the cluster to see if it has a service definition applied to it, to act upon.
In my ballpark figures with a set of 10 active clusters, the cost for running this logic should be under $2/month – yes, two dollars a month to ensure your cluster has the correct number of tasks for a given service.
Do you own estimation with the Lambda Pricing Calculator.

Conclusions

The code can be found on GitHub, and was developed with test-everything philosophy, where I spent a large amount of time learning how to actually write the code and tests elegantly.
Writing out all of the tests and sequences allowed me to find multiple points of refactoring and increased efficiency from my first implementation, leading to a much cleaner solution.
Taking on a project like this is a great way to increase one’s own technical prowess, leading to the ability to reason about other problems.

While I strongly believe that this feature should be part of the ECS platform and not require any client-side intervention, the ability to take the current offerings and extend them via mechanisms such as Events, Lambda and API calls further demonstrates the flexibility and extensibility of the AWS ecosystem.
The feature launched just over a week ago, and I’ve been able to put together an acceptable solution on my own, using the documentation, tooling, and infrastructure while minimizing costs and making my system more reactive to change.

I look forward to what else the ECS, Lambda and CloudWatch Events team cook up in the future!

Setting Up a Datadog-to-AWS Integration

When approaching a new service provider, sometimes it can be confusing on how to get set up to best communicate with them – some processes involve multiple steps, multiple interfaces, confusing terminology, and

Amazon Web Services is an amazing cloud services provider, and in order to allow access informational services inside a customer’s account, a couple of known mechanisms exist to delegate access:

  • Account Keys, where you generate a key and secret and share them. The other party stores these (usually in either clear text or using reversible encryption) and uses them as needed to make API calls
  • Role Delegation, where you create a Role and shared secret to provide to a the external service provider, who then is allowed to use their own internal security credentials to request temporary access to your account’s resources via API calls

In the former model, the keys are exchanged once, and once out of your immediate domain, you have little idea what happens to them.
In the latter, a rule is put into place that requires ongoing authenticated access to request assumption of a known role with a shared secret.

Luckily, in both scenarios, a restrictive IAM Policy is in place that allows only the actions you’ve decided to allow ahead of time.

Setting up the desired access is made simpler by having good documentation on how to do this manually. In this modern era, we likely want to keep our infrastructure as code where possible, as well as have a mechanism to apply the rules and test later if they are still valid.

Here’s a quick example I cooked up using Terraform, a new, popular tool to compose cloud infrastructure as code and execute to create the desired state.

# Read more about variables and how to override them here:
# https://www.terraform.io/docs/configuration/variables.html
variable "aws_region" {
type = "string"
default = "us-east-1"
}
variable "shared_secret" {
type = "string"
default = "SOOPERSEKRET"
}
provider "aws" {
region = "${var.aws_region}"
}
resource "aws_iam_policy" "dd_integration_policy" {
name = "DatadogAWSIntegrationPolicy"
path = "/"
description = "DatadogAWSIntegrationPolicy"
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"autoscaling:Describe*",
"cloudtrail:DescribeTrails",
"cloudtrail:GetTrailStatus",
"cloudwatch:Describe*",
"cloudwatch:Get*",
"cloudwatch:List*",
"ec2:Describe*",
"ec2:Get*",
"ecs:Describe*",
"ecs:List*",
"elasticache:Describe*",
"elasticache:List*",
"elasticloadbalancing:Describe*",
"elasticmapreduce:List*",
"iam:Get*",
"iam:List*",
"kinesis:Get*",
"kinesis:List*",
"kinesis:Describe*",
"logs:Get*",
"logs:Describe*",
"logs:TestMetricFilter",
"rds:Describe*",
"rds:List*",
"route53:List*",
"s3:GetBucketTagging",
"ses:Get*",
"ses:List*",
"sns:List*",
"sns:Publish",
"sqs:GetQueueAttributes",
"sqs:ListQueues",
"sqs:ReceiveMessage"
],
"Effect": "Allow",
"Resource": "*"
}
]
}
EOF
}
resource "aws_iam_role" "dd_integration_role" {
name = "DatadogAWSIntegrationRole"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::464622532012:root" },
"Action": "sts:AssumeRole",
"Condition": { "StringEquals": { "sts:ExternalId": "${var.shared_secret}" } }
}
}
EOF
}
resource "aws_iam_policy_attachment" "allow_dd_role" {
name = "Allow Datadog PolicyAccess via Role"
roles = ["${aws_iam_role.dd_integration_role.name}"]
policy_arn = "${aws_iam_policy.dd_integration_policy.arn}"
}
output "AWS Account ID" {
value = "${aws_iam_role.dd_integration_role.arn}"
}
output "AWS Role Name" {
value = "${aws_iam_role.dd_integration_role.name}"
}
output "AWS External ID" {
value = "${var.shared_secret}"
}

The output should look a lot like this:

The Account ID is actually a full ARN, and you can copy your Account ID from there.
Terraform doesn’t have a mechanism to emit only the Account ID yet – so if you have some ideas, contribute!

Use the Account ID, Role Name and External ID and paste those into the Datadog Integrations dialog, after selecting Role Delegation. This will immediately validate that the permissions are correct, and return an error otherwise.

Don’t forget to click “Install Integration” when you’re done (it’s at the very bottom of the screen).

Now metrics and events will be collected by Datadog from any allowed AWS services, and you can keep this setup instruction in any revision system of your choice.

P.S. I tried to set this up via CloudFormation (Sparkleformation, too!). I ended up writing it “freehand” and took more than 3 times as long to get similar functionality.

You can see the CloudFormation Stack here, and decide which works for you.


Further reading:

Counts are good, States are better

Datadog is great at pulling in large amounts of metrics, and provides a web-based platform to explore, find, and monitor a variety of systems.

One such system integration is PostgresQL (aka ‘Postgres’, ‘PG’) – a popular Open Source object-relational database system, ranking #4 in its class (at the time of this writing), with over 15 years of active development, and an impressive list of featured users.
It’s been on an upwards trend for the past couple of years, fueled in part by Heroku Postgres, and has spun up entire companies supporting running Postgres, as well as Amazon Web Services providing PG as one of their engines in their RDS offering.

It’s awesome at a lot of things that I won’t get into here, but it definitely my go-to choice for relational data.

One of the hardest parts of any system is determining whether the current state of the system is better or worse than before, and tracking down the whys, hows and wheres it got to a worse state.

That’s where Datadog comes in – the Datadog Agent has included PG support since 2011, and over the past 5 years, has progressively improved and updated the mechanisms by which metrics are collected. Read a summary here.

Let’s Focus

Postgres has a large number of metrics associated with it, and there’s much to learn from each.

The one metric that I’m focusing on today is the “connections” metric.

By establishing a periodic collection of the count of connections, we can examine the data points over time and draw lines to show the values.
This is built-in to the current Agent code, named postgresql.connections in Datadog, by selecting the value of the numbackends column from the pg_stat_database table.

01-default-connections

Another two metrics exist, introduced into the code around 2014, that assist with using the counts reported with alerting.
These are postgresql.max_connections and postgresql.percent_usage_connections.

(Note: Changing PG’s max_connections value requires a server restart and in a replication cluster has other implications.)

The latter, percent_usage_connections, is a calculated value, returning ‘current / max’, which you could compute yourself in an alert definition if you wanted to account for other variables.
It is normally sufficient for these purposes.

02-pct_used-connections

A value of postgresql.percent_usage_connections:0.15 tells us that we’re using 15% of our maximum allowable connections. If this hits 1, then we will receive this kind of response from PG:

FATAL: too many connections for role...

And you likely have a Sad Day for a bit after that.

Setting an alert threshold at 0.85 – or a Change Alert to watch the percent change in the values over the previous time window – should prompt an operator to investigate the cause of the connections increase.
This can happen for a variety of reasons such as configuration errors, SQL queries with too-long timeouts, and a host of other possibilities, but at least we’ll know before that Sad Day hits.

Large Connection Counts

If you’ve launched your application, and nobody uses it, you’ll have very low connection counts, you’ll be fine. #dadjoke

If your application is scaling up, you are probably running more instances of said application, and if it uses the database (which is likely), the increase in connections to the database is typically linear with the count of running applications.

Some PG drivers offer connection pooling to the app layer, so as methods execute, instead of opening a fresh connection to the database (which is an expensive operation), the app maintains some amount of “persistent connections” to the database, and the methods can use one of the existing connections to communicate with PG.

This works for a while, especially if the driver can handle application concurrency, and if the overall count of application servers remains low.

The Postgres Wiki has an article on handling the number of database connections, in which the topic of a connection pooler comes up.
An excerpt:

If you look at any graph of PostgreSQL performance with number of connections on the x axis
and tps on the y access [sic] (with nothing else changing), you will see performance climb as
connections rise until you hit saturation, and then you have a “knee” after which performance
falls off.

The need for connection pooling is well established, and the decision to not have this part of core is spelled out in the article.

So we install a PG connection pooler, like PGBouncer (or pgpool, or something else), configure it to connect to PG, and point our apps at the pooler.

In doing so, we configure the pooler to establish some amount of connections to PG, so that when an application requests a connection, it can receive one speedily.

Interlude: Is Idle a Problem?

Over the past 4 years, I’ve heard the topic raised again and again:

If the max_connections is set in the thousands, and the majority of them are in idle state,
is that bad?

Let’s say that we have 10 poolers, and each establishes 100 connections to PG, for a max of 1000. These poolers serve some large number of application servers, but have the 1000 connections at-the-ready for any application request.

It is entirely possible that most of the time, a significant portion of these established connections are idle.

You can see a given connection’s state in the pg_stat_activity table, with a query like this:

SELECT datname, state, COUNT(state)
FROM pg_stat_activity
GROUP BY datname, state
HAVING COUNT(state) > 0;

A sample output from my local dev database that’s not doing much:

datname  | state  | count
---------+--------+-------
postgres | active |     1
postgres | idle   |     2
(2 rows)

We can see that there is a single active connection to the postgres database (that’s me!) and two idle connections from a recent application interaction.

If it’s idle, is it harming anyone?

A similar question was asked on the PG Mailing List in 2015, to which Tom Lane responds to the topic of idle: (see link for full quote):

Those connections have to be examined when gathering snapshot information, since you don’t know that they’re idle until you look.
So the cost of taking a snapshot is proportional to the total number of connections, even when most are idle.
This sort of situation is known to aggravate contention for the ProcArrayLock, which is a performance bottleneck if you’ve got lots of CPUs.

So we now know why idling connections can impact performance, despite not doing anything, especially with modern DBs that we scale up to multi-CPU instances.

Back to the show!

Post-Pooling Idling

Now that we know that high connection counts are bad, and we are able to cut the total count of connections with pooling strategies, we must ask ourselves – how many connections do we actually need to have established, yet not have a high count of idling connections that impact performance.

We could log in, run the SELECT statement from before, and inspect the output, or we could add this to our Datadog monitoring, and trend it over time.

The Agent docs show how to write an Agnet Check, and you could follow the current postgres.py to write another custom check, or you could use the nifty custom_metrics syntax in the default postgres.yaml to extend the check to perform more checks.

Here’s an example:

custom_metrics:
  - # Postgres Connection state
    descriptors:
      - [datname, database]
      - [state, state]
    metrics:
      COUNT(state): [postgresql.connection_state, GAUGE]
    query: >
      SELECT datname, state, %s FROM pg_stat_activity
      GROUP BY datname, state HAVING COUNT(state) > 0;
    relation: false

Wait, what was that?

Let me explain each key in this, in an order that made sense to me, instead of alphabetically.

  • relation: false informs the check to perform this once per collection, not against each of any specified tables (relations) that are part of this database entry in the configuration.
  • query: This is pretty similar to our manual SELECT, with one key differentiation – the %s informs the query to replace this with the contents of the metrics key.
  • metrics: For each entry in here, the query will be run, substituting the key into the query. The metric name and type are specified in the value.
  • descriptors: Each column returned has a name, and here’s how we convert the returned name to a tag on the metric.

Placing this config section in our postgres.yaml file and restarting the Agent gives us the ability to define a query like this in a graph:

sum:postgresql.connection_state{*} by {state}

03-conn_state-by-state

As can be seen in this graph, the majority of my connections are idling, so I might want to re-examine my configuration settings on application or pooler configuration.

Who done it?

Let’s take this one step further, and ask ourselves – now that we know the state of each connection, how might we determine which of our many applications connecting to PG is idling, and target our efforts?

As luck would have it, back in PG 8.5, a change was added to allow for clients to set an application_name value during the connection, and this value would be available in our pg_stat_activity table, as well as in logs.

This typically involves setting a configuration value at connection startup. In Django, this might be done with:

DATABASES = {
  'default': {
    'ENGINE': 'django.db.backends.postgresql',
    ...
    'OPTIONS': {
      'application_name': 'myapp',
    }
    ...

No matter what client library you’re using, most have the facility to pass extra arguments along, some in the form of a database connection URI, so this might look like:

postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp

Again, this all depends on your client library.

I can see clearly now

So now that we have the configuration in place, and have restarted all of our apps, a modification to our earlier Agent configuration code for postgres.yaml would look like:

custom_metrics:
  - # Postgres Connection state
    descriptors:
      - [datname, database]
      - [application_name, application_name]
      - [state, state]
    metrics:
      COUNT(state): [postgresql.connection_state, GAUGE]
    query: >
      SELECT datname, application_name, state, %s FROM pg_stat_activity
      GROUP BY datname, application_name, state HAVING COUNT(state) > 0;
    relation: false

With this extra dimension in place, we can craft queries like this:

sum:postgresql.connection_state{state:idle} by {application_name}

04-conn_state-idle-by-app_name

So now I can see that my worker-medium application has the most idling connections, so there’s some tuning to be done here – either I open too many connections for the application, or it’s not doing much.

I can confirm this with refining the query structure to narrow in on a single application_name:

sum:postgresql.connection_state{application_name:worker-medium} by {state}

05-conn_state-app_name-by-state

So now that I’ve applied methodology of surfacing connection states, and increased visibility into what’s going on, before making any changes to resolve.

Go forth, measure, and learn how your systems evolve!

There’s a New Player in Town, named Habitat

You may have heard some buzz around the launch of Chef‘s new open source project Habitat (still in beta), designed to change a bit of how we think about building and delivering software applications in the modern age.

There’s a lot of press, video announcement, and even a Food Fight Show where we got to chat with some of the brains behind the framework, and get into some of the nitty-gritty details.

In the vibrant Slack channel where a lot of the fast-paced discussion happens with a bunch of the core habitat developers, a community member had brought up a pain point, as many do.
They were trying to build a Python application, and had to result to playing pretty hard with either the PYTHONPATH variable or with sys.path post-dependency install.
One even used Virtualenv inside the isolated environment.

I had worked on making an LLVM compiler package, and while notoriously slow to compile on my laptop, I used the waiting time to get a Python web application working.

My setup is OSX 10.11.5, with Docker (native) 1.12.0-rc2 (almost out of beta!).

I decided to use the Flask web framework to carry out a Hello World, as it would prove a few of pieces:

  • Using Python to install dependencies using pip
  • Adding “local” code into a package
  • Importing the Python package in the app code
  • Executing the custom binary that the Flask package installs

Key element: it needed to be as simple as possible, but no simpler.

On my main machine, I wrote my application.
It listens on port 5000, and responds with a simple phrase.
Yay, I wrote a website.

Then I set about to packaging it into a deliverable where, in habitat’s nomenclature, it becomes a self-contained package, which can then be run via the habitat supervisor.

This all starts with getting the habitat executable, conveniently named hab.
A recent addition to the Homebrew Casks family, installing habitat was as simple as:

$ brew cask install hab

habitat version 0.7.0 is in use during the authoring of this article.

I sat down, wrote a plan.sh file, that describes how to put the pieces together.

There’s a bunch of phases in the build cycle that are fully customizable, or “stub-able” if you don’t want them to take the default action.
Some details were garnered from here, despite my package not being a binary.

Once I got my package built, it was a matter of figuring out how to run it, and one of the default modes is to export the entire thing as a Docker image, so I set about to run that, to get a feel for the iterative development cycle of making the application work as configured within the habitat universe.

(This step usually isn’t the best one for regular application development, but it is good for figuring out what needs to be configured and how.)

# In first OSX shell
$ hab studio enter
[1][default:/src:0]# build
...
   python-hello: Build time: 0m36s
[2][default:/src:0]# hab pkg export docker miketheman/python-hello
...
Successfully built 2d2740a182fb
[3][default:/src:0]#

# In another OSX shell:
$ docker run -it -p 5000:5000 -p 9631:9631 miketheman/python-hello
hab-sup(MN): Starting miketheman/python-hello
hab-sup(GS): Supervisor 172.17.0.3: cb719c1e-0cac-432a-8d86-afb676c3cf7f
hab-sup(GS): Census python-hello.default: 19b7533a-66ba-4c6f-b6b7-c011abd7dbe1
hab-sup(GS): Starting inbound gossip listener
hab-sup(GS): Starting outbound gossip distributor
hab-sup(GS): Starting gossip failure detector
hab-sup(CN): Starting census health adjuster
python-hello(SV): Starting
python-hello(O):  * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

# In a third shell, or use a browser:
$ curl http://localhost:5000
Hello, World!

The code for this example can be found in this GitHub repo.
See the plan.sh and hooks/ for Habitat-related code.
The src/ directory is the actual Python app.

At this point, I declared success.

There’s a large amount of other pieces to the puzzle that I hadn’t explored yet, but getting this part running was the first one.
Items like interacting with the supervisor, director, healthchecks, topologies – these have some basic docs, but there’s not a bevy of examples or use cases yet to lean upon for inspiration.

During this process I uncovered a couple of bugs, submitted some feedback, and the team is very receptive so far.
There’s still a bunch of rough edges to be polished down, many around the documentation, use cases and how the pieces fit together, and what benefit it all drives.

There appears to be some hooks for using Chef Delivery as well – I haven’t seen those yet, as I don’t use Delivery.
I will likely try looking at making a larger strawman deployment to test these pieces another time.

I am looking forward to seeing how this space evolves, and what place habitat will take in the ever-growing, ever-evolving software development life-cycle, as well as how the community approaches these concepts and terminology.

How Do You Let The Cat Out of the Bag?

On what has now become to be known as Star Wars Day, I thought it prudent to write about A New Hope, Mike-style.

A few weeks ago, I took a step in life that is a bit different from everything I’ve ever done before.
I know I’m likely to get questions about it, so I figured I would attempt to preemptively answer here.

I’ve left Datadog, a company I hold close and dear to my heart.

I started working as a consultant for Datadog in 2011 with some co-workers from a earlier position, and joined full-time in 2013. For the past 3 years, I’ve pretty much eaten, dreamt, lived Datadog. It’s been an amazing ride.

Having the fortune to work with some of the smartest minds in the business, I was able to help build what I believe to be the best product in the marketplace of application and systems monitoring.

I still believe in the mission of building the best damn monitoring platform in the world, and have complete faith that the Datadog crew are up to the task.

Q: Were you let go?
A: No, I left of my own free will and accord.

Q: Why would you leave such a great place to work?
A: Well, 3 years (5 if you count the preliminary work) is a reasonable amount of time in today’s fast-paced market.
Over the course of my tenure, I learned a great many things, positively affected the lives of many, and grew in a direction that doesn’t exactly map to the current company’s vision for me.
There is likely a heavy dose of burnout in the mix as well.
Instead of letting it grow and fester until some sour outcome, I found it best to part ways as friends, knowing that we will definitely meet again, in some other capacity.
Taking a break to do some travel, focus on some non-work life goals for a short time felt like the right thing.

Q: Did some other company lure you away?
A: While I am lucky to receive a large amount of unsolicited recruiter email, I have not been hired by anyone else, rather choosing to take some time off to reflect on the past 20 years of my career, and figure out what it is that I want to try next.
I’m also trying a 30-day fitness challenge, something that has been consistently de-prioritized, in attempt to get a handle on my fitness, before jumping headfirst into the next life challenge, so recruiters – you will totally gain brownie points by not contacting me before June 4th.

Q: Are you considering leaving New York City?
A: A most emphatic No. I’ve lived in a couple of places in California, Austin TX, many locations in Israel, and now NYC. I really like the feel of this city.

Q: What about any Open Source you worked on?
A: Before I started at Datadog, and during my employment, I was lucky enough to have times when I was able to work on Open Source software, and will continue to do so as it interests me. It has never paid the bills, rather providing an interesting set of puzzles and challenges to solve.
If there’s a project that interests you and you’d like to help contribute to, please let me know!

Q: What’s your favorite flavor of ice cream?
A: That’s a hard one. I really like chocolate, and am pretty partial to Ben & Jerry’s Phish Food – it’s pretty awesome.

Q: What about a question that you didn’t answer here?
A: I’m pretty much available over all social channels – Facebook, Twitter, LinkedIn – and good ol’ email and phone.
I respond selectively, so please understand that if you don’t hear back for a bit, or at all.
If it’s a really good question that might fit here, I might update the post.

TL; DR: I’m excited about taking a break from work for a bit, and enjoying some lazy summer days. Let’s drink a glass of wine sometime, and May the Fourth Be With You!

Reduce logging volume

Quick self-reminder on reducing logging volume when monitoring an http endpoint with the Datadog Agent HTTP Check.

For nginx, add something like:

    location / {
        if ($http_user_agent ~* "Datadog Agent/.*") {
            access_log off;
        }
        ....

to your site’s location statement.

This should cut down on your logging volume, at the expense of not having a log statement for every time the check runs (once every 20 seconds).