Container-to-Container Communication

Question ❓

In a containerized world, is there a material difference between communicating over local network TCP vs local Unix domain sockets?

Given an application with more than a single container that need to talk to each other, is there an observable difference in latency/throughput when using one inter-component communication method over another from an end-users’ perspective?

Background 🌆

There’s this excellent write-up on the comparison back in 2005, and many things have changed since then, especially around the optimizations in the kernel and networking stack, along with the container runtime that is usually abstracted away from the end user’s concerns. Redis benchmarks from a few years ago also point out significant improvements using Unix sockets when the server and benchmark are co-located.

There’s other studies out there that have their own performance comparisons, and produce images like these – and every example is going to have its own set of controls and caveats.

I wanted to use a common-ish scenario: a web service running on cloud infrastructure I don’t own.

Components 🧩

For the experiment, I chose this set of components:

nginx (web server) – terminate SSL, proxy requests to upstream web server
gunicorn (http server) – speaks HTTP and WSGI protocol, runs application
starlette (python application framework) – handle request/response

I considered using FastAPI for the application layer – but since I didn’t need any of those features, I didn’t add it, but it’s a great framework – check it out!

As gunicorn server runs the starlette framework and the custom application code, I will be referring to them as a single component later as "app", as the tests I’m comparing is the behavior between nginx and the "app" layer, using overall user-facing latency and throughput as the main result.

nginx 🌐

nginx is awesome. Really powerful, and has many built-in features, highly configurable. Been using it for years, and it’s my go-to choice for a reliable web server.

For our purposes, we need an external port to listen for inbound requests, and a stanza to proxy the requests to the upstream application server.

You might ask: Why use nginx at all, if Gunicorn can terminate connections directly? Well, there’s often a class of problems that nginx is better suited at handling rather than a fully-fledged Python runtime – examples include static file serving (robots.txt, favicon.ico et. al.) as well as caching, header or path rewriting, and more.

nginx is a commonly used in front of all manner of applications.

Python Application 🐍

To support the testing of a real-world scenario, I’m creating a JSON response, as that’s how most web applications communicate today. This often incurs some serialization overhead in the application.

I took the example from starlette and added a couple of tweaks to emit the current timestamp and a random number. This prevents any potential caching occurring in any of the layers and polluting the experiment.

Here’s what the main request/response now looks like:

async def homepage(request):
    return JSONResponse(
        {
            "hello": "world",
            "utcnow": datetime.datetime.utcnow().isoformat(),
            "random": random.random(),
        }
    )

A response looks like this:

{
  "hello": "world",
  "utcnow": "2021-12-27T00:31:42.383861",
  "random": 0.5352573557347882
}

And while there are ways to improve JSON serialization speed, or tweak the Python runtime, I wanted to keep the experiment with defaults, since the point isn’t about maximizing total throughput, rather seeing the difference between the architectures.

Cloud Environment ☁️

For this experiment, I chose Amazon Elastic Container Service (ECS) with AWS Fargate compute. These choices provide a way to construct all the pieces needed in a repeatable fashion in the shortest amount of time, and abstract a lot of the infra concerns. To set everything up, I used AWS Copilot CLI, an open-source tool that does even more of the heavy lifting for me.

The Copilot Application type of Load Balanced Web Service will create an Application Load Balancer (ALB), which is the main external component outside my application stack, but an important one for actual scaling, SSL termination at the edge, and more. For the sake of this experiment, we assume (possibly incorrectly!) that ALBs will perform consistently for each test.

Architectures 🏛

Using containers, I wanted to test multiple architecture combinations to see which one proved the "best" when it came to user-facing performance.

Example 1: "tcp"

The communication between nginx container and the app container takes places over the dedicated network created by the Docker runtime (or Container Network Interface in Fargate). This means there’s TCP overhead between nginx and the app – but is it significant? Let’s find out!

Example 2: "sharedvolume"

Here we create a shared volume between the nginx container and the app container. Then we use a Unix domain socket to communicate between the containers using the shared volume.

This architecture maintains a separation of concerns between the two components, which is generally a good practice, so as to have a single essential process per container.

Example 3: "combined"

In this example, we combine both nginx and app in a single container, and use local Unix sockets within the container to communicate.

The main difference here is that we add a process supervisor to run both nginx and app runtimes – which some may consider an anti-pattern. I’m including it for the purpose of the experiment, mainly to uncover if there’s performance variation between a local volume and a shared volume.

This approach simulates what we’d expect in a single "server" scenario – where a traditional instance (hardware or virtual) runs multiple processes and all have some access to a local shared volume for inter-process communication (IPC).

To make this a fair comparison, I’ve also doubled the CPU and memory allocation.

Copilot ✈️

Time to get off the ground.

Copilot CLI assumes you already have an app prepared in a Dockerfile. The Quickstart has you clone a repo with a sample app – so instead I’ve created a Dockerfile for each of the architectures, along with a docker-compose.yml file for local orchestration of the components.

Then I’ll be able to launch and test each one in AWS with its own isolated set of resources – VPC, networking stack, and more.

I’m not going into all the details of how to install Copilot and launch the services, for that, read the Copilot CLI documentation (linked above), and read the experiment code.

This test is using AWS Copilot CLI v1.13.0.

Test Protocol 🔬

There’s an ever-growing list of tools and approaches to benchmark web request/response performance.

For the sake of time, I’ll use a single one here, to focus on the comparison of the server-side architecture performance.

All client-side requests will be performed from an AWS CloudShell instance running in the same AWS Region as the running services (us-east-1) to isolate a lot of potential network chatter. It’s not a perfect isolation of potential variables, but it’ll have to do.

To baseline, I ran each test locally (see later).

Apache Bench

Apache Bench, or ab, is a common tool for testing web endpoints, and is not specific to Apache httpd servers. I’m using: Version 2.3 <$Revision: 1879490 $>

I chose single concurrency, and ran 1,000 requests. I also ignore variable length, as the app can respond with a variable-length random number choice, and ab considers different length responses a failure unless specified.

ab -n 1000 -c 1 -l http://service-target....

Each test should take less than 5 seconds.

The important stats I’m comparing are:

Requests per second (mean) – higher is better
Time per request (mean) – lower is better
Duration at 99th percentile. 99% of all requests completed within (milliseconds) – lower is better

To reduce variance, I also "warmed up" the container by running the test for a larger amount of requests

Local Test

To establish a baseline, I ran the same benchmark test against the local services. Using Docker Desktop 4.3.2 (72729) on macOS. These aren’t demonstrative of a real user experience, but provides a sense of performance before launching the architectures in the cloud.

arch	reqs per sec	ms per req	99th pctile
tcp (local)	679.77	1.471	2
sharedvolume (local)	715.62	1.397	2
combined (local)	705.55	1.871	2

In the local benchmark, the clear loser is the tcp architecture, and the sharedvolume has a slight edge on combined – but not a huge win. No real difference in the 99th percentiles – requests are being served in under 2ms.

This shows that the shared resources for the combined architecture are near the performance of the sharedvolume – possibly due to Docker Desktop’s bridging and network abstraction. A better comparison might be tested on a native Linux machine.

Remote Test

Once I ran through the setup steps using Copilot CLI to create the environment and services, I performed the same ab test, and collected the results in this table:

arch	reqs per sec	ms per req	99th pctile
tcp (aws)	447.57	2.234	5
sharedvolume (aws)	394.55	2.535	6
combined (aws)	428.60	2.333	4

With the remote tests, minor surprise that the combined service performed better than the sharedvolume service, as in the local test it performed worse.

The bigger surprise was to find that the tcp architecture wins slightly over the socket-based architectures.

This could be due to the way ECS Fargate uses the Firecracker microvm, and has tuned the network stack to perform faster than using a shared socket on a volume when communicating between two containers on the same host machine. The best part is – as a consumer of a utility, I don’t care, as long as it’s performing well!

ARM/Graviton Remote Test

With the Copilot manifest defaults for the Intel x86 platform, let’s also test the performance on the linux/arm64 platform (Graviton2, probably).

For this to work, I had to rebuild the nginx sidecars manually, as Copilot doesn’t yet build&push sidecar images. I also had to update the manifest.yml to set the desired platform, and deploy the service with copilot svc deploy .... (The combined version needed some Dockerfile surgery too…)

Results:

arch	reqs per sec	ms per req	99th pctile
tcp (aws/arm)	475.03	2.105	3
sharedvolume (aws/arm)	451.71	2.214	4
combined (aws/arm)	433.94	2.304	4

We can see that all the stats are better on the Graviton architecture, lending some more credibility to studies done by other benchmark posts and papers.

Aside: The linux/arm64-based container images were tens of megabytes smaller, so if space and network pull time is a concern, these will be a few microseconds faster.

Other Testing Tools

If you’re interested in performing longer tests, or emulating different user types, check out some of these other benchmark tools I considered and didn’t use for this experiment:

Python – https://locust.io/ https://molotov.readthedocs.io/
JavaScript – https://k6.io/
Golang – https://github.com/rakyll/hey
C – https://github.com/wg/wrk

There’s also plenty of vendors that build out extensive load testing platforms – I’m not covering any of them here. If you run a test with these, would definitely like to see your results!

Conclusions 💡

Using the Copilot CLI wasn’t without some missteps – the team is hard at work improving the documentation, and are pretty responsive in both their GitHub Issues and Discussions, as well as their Gitter chat room – always helpful when learning a new framework. Once I got the basics, being able to establish a reproducible stack is valuable to the experimentation process, as I was able to provision and tear down the stack easily, as well as update with changes relatively easily.

Remember: these are micro-benchmarks, on not highly-tuned environments or real-world workloads. This test was designed to test a very specific type of workload, which may change as more concurrency is introduced, CPU or memory saturation is achieved, auto-scaling of application instances comes into play, and more.

Your mileage may vary.

When I started this experiment, I assumed the winner would be a socket-based communication architecture (sharedvol or combined), from existing literature, and it also made sense to me. The overhead of creating TCP packets between the processes would be eliminated, and thus performance would be better.

However, in these benchmarks, I found that using the TCP communication architecture performs best, possibly due to optimizations beyond our view in the underlying stack. This is precisely what I want from an infrastructure vendor – for them to figure out how to optimize performance without having to re-architect an application to perform better in a given deployment scenario.

The main conclusion I’ve drawn is: Using TCP to communicate between containers is best, as it affords the most flexibility, follows established patterns, and performs slightly better than the alternatives in a real(ish) world scenario. And if you can, use Graviton2 (ARM) CPU architecture.

Go forth, test your own scenarios, and let me know what you come up with. (Don’t forget to delete your resource when done!! 💸 )