Fixing unintended consequences of the past

In the age of technology, everyone races forward to get the win. Anything that can provide you the competitive edge is considering important.
This is especially true in the realm of web media, where optimizing for page load times, providing secure transport, adhering to standards can make a difference in how a site is handled by client browsers, ranked by search engines, and most importantly how it is seen by viewers.

To this end, there are many sites, services and companies that will provide methods to audit a site and point out what could be problematic – count broken links, produce reports of actionable corrections, and more.
Some are better than others, and occasionally, you’ll come across something you’ve never seen before.

Recently, I was pinged about pages on a site that is hosted on an Amazon Simple Storage Service (S3) website-enabled bucket.
Since S3 is an object store only, this means that the pages in this site are statically generated and there is no associated web server, backend database, or other components to serve the pages.

This model is becoming more common for sites that can be simplified to run with no dynamic loading of data from a database, withstand heavy bursts of requests, as well as run cheaply (there’s even a free tier, beyond which pricing still remains affordable).

The idea is that you create your content in one format, run a compiler process to generate all the rendered files containing the links and content, and then upload the the compiled files to the S3 location to be requested by browsers. There are many guides on the web on how to do this – I’m not going to link to any now, search and ye shall find.

This particular site had been deployed since 2011 – and the mechanism to copy compiled files to S3 has been using the popular open source command line tool s3cmd – deployment basically looked like this (and still does!):

 s3cmd sync output/ s3://

where output/ contains the compiled files, ready for deployment.

This has worked very well for over 4 years – until it came to my attention that when uploading to S3, the s3cmd tool was adding some metadata to each file as it uploaded it, as part of the design to support website hosting on S3.

For instance, when uploading a .css file to S3, s3cmd attempts to determine extra details about the file, and set the correct metadata for browsers to understand, such as Content-Type: text/css.
This is a critical function, as it would be difficult to take the time to determine each file’s content type, set that manually, across many files.
You can read more about content media types on Wikipedia.

Since this project was set up a long time ago, the version of s3cmd used as still in alpha stage – and it was used because it performed well enough, and nothing broke, so we were happy to continue running the with same version since early 2013.

The problem reported to me was that many files on the site were returning an invalid Content-Encoding value, something that has been typically not a problem, as the client’s browser will send an Accept-Encoding header when making a request, typically something along the lines of:

Browser: Hi there! Can I have this resource, and I'll accept a response encoded in the following formats: a, b, or c
Server: No problem! Here's the resource you're looking for, with a content encoded in XYZZY

Now, the XYZZY in this example was being set by the s3cmd upload process, and it was determined to be a bug and fixed in late 2013, but since we never knew about the problem, and the site loads just fine, we never addressed it.
There have been even more stability fixes and releases of s3cmd since – as recently as February 2015.

The particular invalid encodings being set were UTF-8 and ANSI_X3.4-1968. While these are valid encodings for files, they are invalid values for the Content-Encoding field.

Here’s an example of how to show the headers of a particular remote file:

$ curl -sI | grep Content
Content-Encoding: ANSI_X3.4-1968
Content-Type: text/css
Content-Length: 7073

Many modern browsers will send something along the lines of ‘Accept-Encoding: gzip, deflate, sdch‘ in their request header, in hopes that the server can respond with one that matches, and then save on overall bytes sent over the wire, to speed up pages.

It’s the responsibility of the client (browser) to handle the response. I looked into the source code of Chromium (the basis for Google Chrome), and can see from here that in my example above, at Content-Encoding type of XYZZY will pretty much be ignored, which in this case, is fine, since we’re sending an invalid type.

So there’s no direct user impact, why should we care? Well, according to some popular ranking engines:

Using non-HTML content types for landing pages results in significantly reduced SEO ranking.

So all of this is fine, cool – update s3cmd tool to a newer version, and upload the output files again? Well, it’s not that simple.

Since during a sync operation, s3cmd determines what files might have changed, and only uploads the changed ones, it doesn’t reset the object metadata, as this is basically a new object, and the file itself hasn’t been changed.

One solution might be to edit every file, add an extra space somewhere – maybe an extra blank line at the end – then compile, deploy the changed files – however this might take too long.

Instead, I decided to solve the problem of iterating over every object in a bucket, and checking to see if it had the incorrect Content-Encoding set, and create a new copy of the file without the heading set.

This was pretty straightforward, once I understood the concept of object immutability – once written, you can’t change it, rather what feels like a change from a user interface actually creates a new version of the object with the new settings/metadata.

I also didn’t want to have to download each file locally and then upload it back to S3 – that it a slow operation, and could result in extra network traffic and disk space consumption.

Instead, I used the AWS SDK for Ruby gem, and came up with a short-and-sweet solution:

The code aims to be short and sweet, and sure enough, post-execution, we get the response without the offending header:

$ curl -sI | grep Content
Content-Type: text/css
Content-Length: 7073

This swift diagnosis and resolution would not have been possible had the tooling being used not been open source, as many times I was trying to figure out why something behaved the way it did, and while not being familiar with the code, I could reason enough about how things work in general to apply that reasoning on how I should implement my resolution.

Support open source where possible, and happy hunting!

Read more on the standards RFC2616.