Setting Up a Datadog-to-AWS Integration

When approaching a new service provider, sometimes it can be confusing on how to get set up to best communicate with them – some processes involve multiple steps, multiple interfaces, confusing terminology, and

Amazon Web Services is an amazing cloud services provider, and in order to allow access informational services inside a customer’s account, a couple of known mechanisms exist to delegate access:

Account Keys, where you generate a key and secret and share them. The other party stores these (usually in either clear text or using reversible encryption) and uses them as needed to make API calls
Role Delegation, where you create a Role and shared secret to provide to a the external service provider, who then is allowed to use their own internal security credentials to request temporary access to your account’s resources via API calls

In the former model, the keys are exchanged once, and once out of your immediate domain, you have little idea what happens to them.
In the latter, a rule is put into place that requires ongoing authenticated access to request assumption of a known role with a shared secret.

Luckily, in both scenarios, a restrictive IAM Policy is in place that allows only the actions you’ve decided to allow ahead of time.

Setting up the desired access is made simpler by having good documentation on how to do this manually. In this modern era, we likely want to keep our infrastructure as code where possible, as well as have a mechanism to apply the rules and test later if they are still valid.

Here’s a quick example I cooked up using Terraform, a new, popular tool to compose cloud infrastructure as code and execute to create the desired state.

	# Read more about variables and how to override them here:
	# https://www.terraform.io/docs/configuration/variables.html
	variable "aws_region" {
	type = "string"
	default = "us-east-1"
	}

	variable "shared_secret" {
	type = "string"
	default = "SOOPERSEKRET"
	}

	provider "aws" {
	region = "${var.aws_region}"
	}

	resource "aws_iam_policy" "dd_integration_policy" {
	name = "DatadogAWSIntegrationPolicy"
	path = "/"
	description = "DatadogAWSIntegrationPolicy"

	policy = <<EOF
	{
	"Version": "2012-10-17",
	"Statement": [
	{
	"Action": [
	"autoscaling:Describe*",
	"cloudtrail:DescribeTrails",
	"cloudtrail:GetTrailStatus",
	"cloudwatch:Describe*",
	"cloudwatch:Get*",
	"cloudwatch:List*",
	"ec2:Describe*",
	"ec2:Get*",
	"ecs:Describe*",
	"ecs:List*",
	"elasticache:Describe*",
	"elasticache:List*",
	"elasticloadbalancing:Describe*",
	"elasticmapreduce:List*",
	"iam:Get*",
	"iam:List*",
	"kinesis:Get*",
	"kinesis:List*",
	"kinesis:Describe*",
	"logs:Get*",
	"logs:Describe*",
	"logs:TestMetricFilter",
	"rds:Describe*",
	"rds:List*",
	"route53:List*",
	"s3:GetBucketTagging",
	"ses:Get*",
	"ses:List*",
	"sns:List*",
	"sns:Publish",
	"sqs:GetQueueAttributes",
	"sqs:ListQueues",
	"sqs:ReceiveMessage"
	],
	"Effect": "Allow",
	"Resource": "*"
	}
	]
	}
	EOF
	}

	resource "aws_iam_role" "dd_integration_role" {
	name = "DatadogAWSIntegrationRole"

	assume_role_policy = <<EOF
	{
	"Version": "2012-10-17",
	"Statement": {
	"Effect": "Allow",
	"Principal": { "AWS": "arn:aws:iam::464622532012:root" },
	"Action": "sts:AssumeRole",
	"Condition": { "StringEquals": { "sts:ExternalId": "${var.shared_secret}" } }
	}
	}
	EOF
	}

	resource "aws_iam_policy_attachment" "allow_dd_role" {
	name = "Allow Datadog PolicyAccess via Role"
	roles = ["${aws_iam_role.dd_integration_role.name}"]
	policy_arn = "${aws_iam_policy.dd_integration_policy.arn}"
	}

	output "AWS Account ID" {
	value = "${aws_iam_role.dd_integration_role.arn}"
	}

	output "AWS Role Name" {
	value = "${aws_iam_role.dd_integration_role.name}"
	}

	output "AWS External ID" {
	value = "${var.shared_secret}"
	}

view raw datadog-role-delegation.tf hosted with ❤ by GitHub

The output should look a lot like this:

The Account ID is actually a full ARN, and you can copy your Account ID from there.
Terraform doesn’t have a mechanism to emit only the Account ID yet – so if you have some ideas, contribute!

Use the Account ID, Role Name and External ID and paste those into the Datadog Integrations dialog, after selecting Role Delegation. This will immediately validate that the permissions are correct, and return an error otherwise.

Don’t forget to click “Install Integration” when you’re done (it’s at the very bottom of the screen).

Now metrics and events will be collected by Datadog from any allowed AWS services, and you can keep this setup instruction in any revision system of your choice.

P.S. I tried to set this up via CloudFormation (Sparkleformation, too!). I ended up writing it “freehand” and took more than 3 times as long to get similar functionality.

You can see the CloudFormation Stack here, and decide which works for you.

Counts are good, States are better

Datadog is great at pulling in large amounts of metrics, and provides a web-based platform to explore, find, and monitor a variety of systems.

One such system integration is PostgresQL (aka ‘Postgres’, ‘PG’) – a popular Open Source object-relational database system, ranking #4 in its class (at the time of this writing), with over 15 years of active development, and an impressive list of featured users.
It’s been on an upwards trend for the past couple of years, fueled in part by Heroku Postgres, and has spun up entire companies supporting running Postgres, as well as Amazon Web Services providing PG as one of their engines in their RDS offering.

It’s awesome at a lot of things that I won’t get into here, but it definitely my go-to choice for relational data.

One of the hardest parts of any system is determining whether the current state of the system is better or worse than before, and tracking down the whys, hows and wheres it got to a worse state.

That’s where Datadog comes in – the Datadog Agent has included PG support since 2011, and over the past 5 years, has progressively improved and updated the mechanisms by which metrics are collected. Read a summary here.

Let’s Focus

Postgres has a large number of metrics associated with it, and there’s much to learn from each.

The one metric that I’m focusing on today is the “connections” metric.

By establishing a periodic collection of the count of connections, we can examine the data points over time and draw lines to show the values.
This is built-in to the current Agent code, named postgresql.connections in Datadog, by selecting the value of the numbackends column from the pg_stat_database table.

Another two metrics exist, introduced into the code around 2014, that assist with using the counts reported with alerting.
These are postgresql.max_connections and postgresql.percent_usage_connections.

(Note: Changing PG’s max_connections value requires a server restart and in a replication cluster has other implications.)

The latter, percent_usage_connections, is a calculated value, returning ‘current / max’, which you could compute yourself in an alert definition if you wanted to account for other variables.
It is normally sufficient for these purposes.

A value of postgresql.percent_usage_connections:0.15 tells us that we’re using 15% of our maximum allowable connections. If this hits 1, then we will receive this kind of response from PG:

FATAL: too many connections for role...

And you likely have a Sad Day for a bit after that.

Setting an alert threshold at 0.85 – or a Change Alert to watch the percent change in the values over the previous time window – should prompt an operator to investigate the cause of the connections increase.
This can happen for a variety of reasons such as configuration errors, SQL queries with too-long timeouts, and a host of other possibilities, but at least we’ll know before that Sad Day hits.

Large Connection Counts

If you’ve launched your application, and nobody uses it, you’ll have very low connection counts, you’ll be fine. #dadjoke

If your application is scaling up, you are probably running more instances of said application, and if it uses the database (which is likely), the increase in connections to the database is typically linear with the count of running applications.

Some PG drivers offer connection pooling to the app layer, so as methods execute, instead of opening a fresh connection to the database (which is an expensive operation), the app maintains some amount of “persistent connections” to the database, and the methods can use one of the existing connections to communicate with PG.

This works for a while, especially if the driver can handle application concurrency, and if the overall count of application servers remains low.

The Postgres Wiki has an article on handling the number of database connections, in which the topic of a connection pooler comes up.
An excerpt:

If you look at any graph of PostgreSQL performance with number of connections on the x axis
and tps on the y access [sic] (with nothing else changing), you will see performance climb as
connections rise until you hit saturation, and then you have a “knee” after which performance
falls off.

The need for connection pooling is well established, and the decision to not have this part of core is spelled out in the article.

So we install a PG connection pooler, like PGBouncer (or pgpool, or something else), configure it to connect to PG, and point our apps at the pooler.

In doing so, we configure the pooler to establish some amount of connections to PG, so that when an application requests a connection, it can receive one speedily.

Interlude: Is Idle a Problem?

Over the past 4 years, I’ve heard the topic raised again and again:

If the max_connections is set in the thousands, and the majority of them are in idle state,
is that bad?

Let’s say that we have 10 poolers, and each establishes 100 connections to PG, for a max of 1000. These poolers serve some large number of application servers, but have the 1000 connections at-the-ready for any application request.

It is entirely possible that most of the time, a significant portion of these established connections are idle.

You can see a given connection’s state in the pg_stat_activity table, with a query like this:

SELECT datname, state, COUNT(state)
FROM pg_stat_activity
GROUP BY datname, state
HAVING COUNT(state) > 0;

A sample output from my local dev database that’s not doing much:

datname  | state  | count
---------+--------+-------
postgres | active |     1
postgres | idle   |     2
(2 rows)

We can see that there is a single active connection to the postgres database (that’s me!) and two idle connections from a recent application interaction.

If it’s idle, is it harming anyone?

A similar question was asked on the PG Mailing List in 2015, to which Tom Lane responds to the topic of idle: (see link for full quote):

Those connections have to be examined when gathering snapshot information, since you don’t know that they’re idle until you look.
So the cost of taking a snapshot is proportional to the total number of connections, even when most are idle.
This sort of situation is known to aggravate contention for the ProcArrayLock, which is a performance bottleneck if you’ve got lots of CPUs.

So we now know why idling connections can impact performance, despite not doing anything, especially with modern DBs that we scale up to multi-CPU instances.

Back to the show!

Post-Pooling Idling

Now that we know that high connection counts are bad, and we are able to cut the total count of connections with pooling strategies, we must ask ourselves – how many connections do we actually need to have established, yet not have a high count of idling connections that impact performance.

We could log in, run the SELECT statement from before, and inspect the output, or we could add this to our Datadog monitoring, and trend it over time.

The Agent docs show how to write an Agnet Check, and you could follow the current postgres.py to write another custom check, or you could use the nifty custom_metrics syntax in the default postgres.yaml to extend the check to perform more checks.

Here’s an example:

custom_metrics:
  - # Postgres Connection state
    descriptors:
      - [datname, database]
      - [state, state]
    metrics:
      COUNT(state): [postgresql.connection_state, GAUGE]
    query: >
      SELECT datname, state, %s FROM pg_stat_activity
      GROUP BY datname, state HAVING COUNT(state) > 0;
    relation: false

Wait, what was that?

Let me explain each key in this, in an order that made sense to me, instead of alphabetically.

relation: false informs the check to perform this once per collection, not against each of any specified tables (relations) that are part of this database entry in the configuration.
query: This is pretty similar to our manual SELECT, with one key differentiation – the %s informs the query to replace this with the contents of the metrics key.
metrics: For each entry in here, the query will be run, substituting the key into the query. The metric name and type are specified in the value.
descriptors: Each column returned has a name, and here’s how we convert the returned name to a tag on the metric.

Placing this config section in our postgres.yaml file and restarting the Agent gives us the ability to define a query like this in a graph:

sum:postgresql.connection_state{*} by {state}

As can be seen in this graph, the majority of my connections are idling, so I might want to re-examine my configuration settings on application or pooler configuration.

Who done it?

Let’s take this one step further, and ask ourselves – now that we know the state of each connection, how might we determine which of our many applications connecting to PG is idling, and target our efforts?

As luck would have it, back in PG 8.5, a change was added to allow for clients to set an application_name value during the connection, and this value would be available in our pg_stat_activity table, as well as in logs.

This typically involves setting a configuration value at connection startup. In Django, this might be done with:

DATABASES = {
  'default': {
    'ENGINE': 'django.db.backends.postgresql',
    ...
    'OPTIONS': {
      'application_name': 'myapp',
    }
    ...

No matter what client library you’re using, most have the facility to pass extra arguments along, some in the form of a database connection URI, so this might look like:

postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp

Again, this all depends on your client library.

I can see clearly now

So now that we have the configuration in place, and have restarted all of our apps, a modification to our earlier Agent configuration code for postgres.yaml would look like:

custom_metrics:
  - # Postgres Connection state
    descriptors:
      - [datname, database]
      - [application_name, application_name]
      - [state, state]
    metrics:
      COUNT(state): [postgresql.connection_state, GAUGE]
    query: >
      SELECT datname, application_name, state, %s FROM pg_stat_activity
      GROUP BY datname, application_name, state HAVING COUNT(state) > 0;
    relation: false

With this extra dimension in place, we can craft queries like this:

sum:postgresql.connection_state{state:idle} by {application_name}

So now I can see that my worker-medium application has the most idling connections, so there’s some tuning to be done here – either I open too many connections for the application, or it’s not doing much.

I can confirm this with refining the query structure to narrow in on a single application_name:

sum:postgresql.connection_state{application_name:worker-medium} by {state}

So now that I’ve applied methodology of surfacing connection states, and increased visibility into what’s going on, before making any changes to resolve.

Go forth, measure, and learn how your systems evolve!

There’s a New Player in Town, named Habitat

You may have heard some buzz around the launch of Chef‘s new open source project Habitat (still in beta), designed to change a bit of how we think about building and delivering software applications in the modern age.

There’s a lot of press, video announcement, and even a Food Fight Show where we got to chat with some of the brains behind the framework, and get into some of the nitty-gritty details.

In the vibrant Slack channel where a lot of the fast-paced discussion happens with a bunch of the core habitat developers, a community member had brought up a pain point, as many do.
They were trying to build a Python application, and had to result to playing pretty hard with either the PYTHONPATH variable or with sys.path post-dependency install.
One even used Virtualenv inside the isolated environment.

I had worked on making an LLVM compiler package, and while notoriously slow to compile on my laptop, I used the waiting time to get a Python web application working.

My setup is OSX 10.11.5, with Docker (native) 1.12.0-rc2 (almost out of beta!).

I decided to use the Flask web framework to carry out a Hello World, as it would prove a few of pieces:

Using Python to install dependencies using pip
Adding “local” code into a package
Importing the Python package in the app code
Executing the custom binary that the Flask package installs

Key element: it needed to be as simple as possible, but no simpler.

On my main machine, I wrote my application.
It listens on port 5000, and responds with a simple phrase.
Yay, I wrote a website.

Then I set about to packaging it into a deliverable where, in habitat’s nomenclature, it becomes a self-contained package, which can then be run via the habitat supervisor.

This all starts with getting the habitat executable, conveniently named hab.
A recent addition to the Homebrew Casks family, installing habitat was as simple as:

$ brew cask install hab

habitat version 0.7.0 is in use during the authoring of this article.

I sat down, wrote a plan.sh file, that describes how to put the pieces together.

There’s a bunch of phases in the build cycle that are fully customizable, or “stub-able” if you don’t want them to take the default action.
Some details were garnered from here, despite my package not being a binary.

Once I got my package built, it was a matter of figuring out how to run it, and one of the default modes is to export the entire thing as a Docker image, so I set about to run that, to get a feel for the iterative development cycle of making the application work as configured within the habitat universe.

(This step usually isn’t the best one for regular application development, but it is good for figuring out what needs to be configured and how.)

# In first OSX shell
$ hab studio enter
[1][default:/src:0]# build
...
   python-hello: Build time: 0m36s
[2][default:/src:0]# hab pkg export docker miketheman/python-hello
...
Successfully built 2d2740a182fb
[3][default:/src:0]#

# In another OSX shell:
$ docker run -it -p 5000:5000 -p 9631:9631 miketheman/python-hello
hab-sup(MN): Starting miketheman/python-hello
hab-sup(GS): Supervisor 172.17.0.3: cb719c1e-0cac-432a-8d86-afb676c3cf7f
hab-sup(GS): Census python-hello.default: 19b7533a-66ba-4c6f-b6b7-c011abd7dbe1
hab-sup(GS): Starting inbound gossip listener
hab-sup(GS): Starting outbound gossip distributor
hab-sup(GS): Starting gossip failure detector
hab-sup(CN): Starting census health adjuster
python-hello(SV): Starting
python-hello(O):  * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

# In a third shell, or use a browser:
$ curl http://localhost:5000
Hello, World!

The code for this example can be found in this GitHub repo.
See the plan.sh and hooks/ for Habitat-related code.
The src/ directory is the actual Python app.

At this point, I declared success.

There’s a large amount of other pieces to the puzzle that I hadn’t explored yet, but getting this part running was the first one.
Items like interacting with the supervisor, director, healthchecks, topologies – these have some basic docs, but there’s not a bevy of examples or use cases yet to lean upon for inspiration.

During this process I uncovered a couple of bugs, submitted some feedback, and the team is very receptive so far.
There’s still a bunch of rough edges to be polished down, many around the documentation, use cases and how the pieces fit together, and what benefit it all drives.

There appears to be some hooks for using Chef Delivery as well – I haven’t seen those yet, as I don’t use Delivery.
I will likely try looking at making a larger strawman deployment to test these pieces another time.

I am looking forward to seeing how this space evolves, and what place habitat will take in the ever-growing, ever-evolving software development life-cycle, as well as how the community approaches these concepts and terminology.

How Do You Let The Cat Out of the Bag?

On what has now become to be known as Star Wars Day, I thought it prudent to write about A New Hope, Mike-style.

A few weeks ago, I took a step in life that is a bit different from everything I’ve ever done before.
I know I’m likely to get questions about it, so I figured I would attempt to preemptively answer here.

I’ve left Datadog, a company I hold close and dear to my heart.

I started working as a consultant for Datadog in 2011 with some co-workers from a earlier position, and joined full-time in 2013. For the past 3 years, I’ve pretty much eaten, dreamt, lived Datadog. It’s been an amazing ride.

Having the fortune to work with some of the smartest minds in the business, I was able to help build what I believe to be the best product in the marketplace of application and systems monitoring.

I still believe in the mission of building the best damn monitoring platform in the world, and have complete faith that the Datadog crew are up to the task.

Q: Were you let go?
A: No, I left of my own free will and accord.

Q: Why would you leave such a great place to work?
A: Well, 3 years (5 if you count the preliminary work) is a reasonable amount of time in today’s fast-paced market.
Over the course of my tenure, I learned a great many things, positively affected the lives of many, and grew in a direction that doesn’t exactly map to the current company’s vision for me.
There is likely a heavy dose of burnout in the mix as well.
Instead of letting it grow and fester until some sour outcome, I found it best to part ways as friends, knowing that we will definitely meet again, in some other capacity.
Taking a break to do some travel, focus on some non-work life goals for a short time felt like the right thing.

Q: Did some other company lure you away?
A: While I am lucky to receive a large amount of unsolicited recruiter email, I have not been hired by anyone else, rather choosing to take some time off to reflect on the past 20 years of my career, and figure out what it is that I want to try next.
I’m also trying a 30-day fitness challenge, something that has been consistently de-prioritized, in attempt to get a handle on my fitness, before jumping headfirst into the next life challenge, so recruiters – you will totally gain brownie points by not contacting me before June 4th.

Q: Are you considering leaving New York City?
A: A most emphatic No. I’ve lived in a couple of places in California, Austin TX, many locations in Israel, and now NYC. I really like the feel of this city.

Q: What about any Open Source you worked on?
A: Before I started at Datadog, and during my employment, I was lucky enough to have times when I was able to work on Open Source software, and will continue to do so as it interests me. It has never paid the bills, rather providing an interesting set of puzzles and challenges to solve.
If there’s a project that interests you and you’d like to help contribute to, please let me know!

Q: What’s your favorite flavor of ice cream?
A: That’s a hard one. I really like chocolate, and am pretty partial to Ben & Jerry’s Phish Food – it’s pretty awesome.

Q: What about a question that you didn’t answer here?
A: I’m pretty much available over all social channels – Facebook, Twitter, LinkedIn – and good ol’ email and phone.
I respond selectively, so please understand that if you don’t hear back for a bit, or at all.
If it’s a really good question that might fit here, I might update the post.

TL; DR: I’m excited about taking a break from work for a bit, and enjoying some lazy summer days. Let’s drink a glass of wine sometime, and May the Fourth Be With You!