Better Cache Busters (aka Asset Timestamps)

If you read any article on web performance they will almost undoubtedly mention expire-headers. They effectively tell the browser not to bother asking the server if an asset (e.g. an image or stylesheet) has changed until the expiration time. Ideally they should be set for many years in the future so that client browsers aren’t spending tons of time waiting for 304 Not Modified responses. The problem with expires headers is that when the file changes you need some way to signal the browser to fetch the updated file. The general trick is to change something in the URL which makes the browser think it is a new file and hence fetch it from the server. This is often called a cache-buster. That is why when you pop open Firebug you will often see files named puppies_3123141.png or styles.css?12345678.

Now that we know something needs to be added to the URL, the question becomes what should we add? There are three properties that we are really interested in:

1. The value should (only) change when the contents of the file changes
2. The value should be consistent across different machines
3. It should be fast to compute

The first possibility is to take the approach used by Ruby on Rails which is to use the modified time of the file. While this satisfies property 3, it does not work for 1 or 2. It breaks property 1 because appending an empty string to a file causes it modified time to update without the actual contents of the file changing. Property 2 is a much larger issue which many people have to face. Static assets are generally served from multiple machines, that means the modified time needs to be consistent across all of them. This is very difficult to achieve, particularly when the files are under version control. While modified time isn’t a great solution, it does bring us to another possibility, version control ids.

One of the great things about version control systems is that in order to give you a reference to a particular commit, unique ids have to be generated for each one. Given that is the case, we can simply use the current commit id (or hash for you git’ers) as the cache buster. We are currently on commit X, I update a file and commit it, we are now on commit Y. Since all machines will be updating their files from version control, everyone should be on the same commit id. In terms of speed, its not necessarily the fastest (especially if you are using SVN) but it only needs to retrieved once since it is the same for all files. The problem with commit ids is that it violates property 1. When a single file changes in the repo, every files’ id changes. That means that every time you deploy new code for your webapp, each client is going to fetch all the files again, even if all you changed was a README. Getting better, but still leaving something to be desired.

The last possibility I am going to talk about is an oldie but goodie, the MD5 hash. The cache buster of an asset is simply the MD5 hash of the asset itself. It satisfies property 1, and as an added bonus if the file is ever changed and then rolled back the MD5 hash will roll back with it (git reset anyone?). Property 2 is no problem, the contents of the file is the same across the different machines, hence so is the MD5. The only thing left is property 3, speed. Clearly computing the MD5 is going to be more time consuming than fetching modified time. However, just about every language has a standard hash library written in C for computing MD5’s and its pretty darn fast. The only place I could see this being an issue is if you have very large files or a extremely large number of them. Even still, you can just write a deploy script that precomputes all the hashes beforehand.

Overall, using MD5s as a cache buster give you all of the nice properties you could want in a cache buster with very little drawback. I went ahead and wrote a monkey patch for Rails that changes the asset id method to use MD5’s, the source code is available here (http://pastie.org/1164279). Enjoy busting caches.

2 comments:

Bryce Boe said...

"The problem with commit ids is that it violates property 1. When a single file changes in the repo, every files’ id changes."

It need not be that way: Using `svn info` on the individual files, you can obtain the revision number that the file was last changed at.

Kerrick Long said...

If you want to use version control for cache busting, at least with git, it's quite possible to do without violating property 1. You can get the commit hash for the last change to a single file by running the following, replacing the filename:

git log --format="%H" --max-count=1 README.md

So, using this technique, your modified cache busting source code is in this gist: https://gist.github.com/2839458

Thoughts?

Post a Comment