JEREMY WAGNER'S

WEB DEVELOPMENT BLOG

WITH OCCASIONAL RAMBLING DIATRIBES ABOUT OTHER STUFF.

Bust Caches Like a Pro

September 12, 2016

Hey, guys! I talk about this topic (and many others) in my upcoming book from Manning publications. You can get 39% off with the code wagnerdz. I hope you like this post!

Browser caches are great. With a good cache-control policy, you can take advantage of the browser cache to seriously increase the speed of return visits to your site. On a multi-page site where many resources are shared across pages, a primed browser cache can reduce load times when users navigate to subsequent pages.

What sucks, though, is when you have to bust a cache for an asset that changes. Cache busting (or cache invalidation, as it is commonly known) is what we must do when we change an asset, but want to make sure the browser downloads the new version of it.

Let's say you have a CSS file named global.css, and the response headers for it look something like this:


HTTP/2 200
date: Tue, 06 Sep 2016 20:13:28 GMT
last-modified: Thu, 01 Sep 2016 18:33:18 GMT
content-type: text/css
cache-control: public, max-age=2592000

See that cache-control header at the bottom? In this example, it tells the browser two things:

  1. public says this resource can be cached by anyone, including any proxy services such as a CDN edge server.
  2. max-age=2592000 says this resource is good to keep for 2,592,000 seconds from the time it was first fetched (that's 30 days for you humans reading this).

This all works wonderfully until you have to make a change to that asset. Because what ends up happening is if a user who has previously visited your site returns with a primed cache, the browser looks at the cache-control policy for that asset. If the time specified in the max-age directive hasn't elapsed, the asset is still considered "fresh" as far as the browser is concerned. The browser doesn't know (or particularly care) that you have a newer version of it on the server. It will happily chug along with the outdated version in its cache until the expiration time has been reached.

So that's no good, and there needs to be a solution. "Just reload the page!" you might hear someone say. Indeed, that does work. On a side note, tell that one to your client. They love hearing that one. For real. Try it and report back to me.

Hey, valued customer! How do you feel about telling your users to reload?
Hey, valued customer! How do you feel about telling your users to reload?

Anywho, you might have run into this problem, and you probably already know about the solution. In case you don't, the way we force the browser to grab a copy of a changed asset without regard to its cache-control policy (or depending on users to reload the page) is to version it using a query string. Below is an example of doing this in a CSS include via the <link> tag:


<link rel="stylesheet" type="text/css" href="css/global.css?v=1">

See that little query string in the href attribute? That part that says ?v=1? The browser makes a distinction between an asset URI of css/global.css and css/global.css?v=1. This is great, because as long as your site's HTML isn't being aggressively cached (which you shouldn't do for this very reason), you can update the reference to an asset with a query string. When you do this, the new asset will get picked up by visitors, even those who are returning with a primed cache containing an old version of it. Neat!

For those that loathe busywork

"Yeah, whatever. Thanks buddy. I get to muck around with query strings every time I change something on my site, now. A real prince you are, pal." I understand. I hate busywork, too. This is where a back end language like PHP comes in handy. By using the language's md5_file function, we can create an MD5 checksum of the asset we want to version, and use that as the value in the query string parameter. This works great because the checksum is generated based on the contents of the file, and not some other arbitrary aspect of it. If you push a new version of your site to production, and your assets never change, the checksum value stays the same. It only changes if the file contents do. This behavior is preferred to using something like a file's last modified time, where accidental modifications of a file in production can force a download of an asset that might not actually have changed at all. Let's look at an example. I use the following approach on this site, and it works well for my purposes. It starts off with an array of assets I want to version like so:


$versions = []; // Creates an empty array
$versions["global.css"] = cacheString("/css/global.css", $pathPrefix);
$versions["fonts-loaded.css"] = cacheString("/css/fonts-loaded.css", $pathPrefix);
$versions["debounce.js"] = cacheString("/js/debounce.js", $pathPrefix);
$versions["lazyload.js"] = cacheString("/js/lazyload.js", $pathPrefix);
$versions["nav.js"] = cacheString("/js/nav.js", $pathPrefix);
$versions["attach-nav.js"] = cacheString("/js/attach-nav.js", $pathPrefix);
$versions["load-fonts.js"] = cacheString("/js/load-fonts.js", $pathPrefix);

This code depends on a function called cacheString, which is shown below:


function cacheString($string, $pathPrefix){
    return substr(md5_file($pathPrefix . $string), 0, 8);
}

This little function uses substr to return a truncated version of the checksum returned from md5_file. This was a judgment call on my part, because MD5 checksums are 32 characters long. That seems like a rather excessive amount of data to just plop into a bunch of query strings. 8 seems like a reasonable string length. Might there be checksum collisions in a space this small? Maybe. It hasn't been a problem yet, and I rather doubt it would be. You can always expand the space a bit if this concerns you.

From there, it's a simple matter of appending the checksums to the asset references in my application HTML like so:


<link rel="stylesheet" href="/css/global.css?v=<?php echo($versions["global.css"]); ?>" type="text/css">

Pretty slick, if I do say so myself. Well, okay. I'm sure I wasn't the first person to think of this, but I did think of it on my own. So now I get to bug you about it in a stupid blog post like this.

What if I just revalidate assets on the server instead?

What I assume you mean is that you're using a cache-control directive like no-cache, which paradoxically, doesn't imply that the asset isn't cached, but is checked for staleness by the server via some validation mechanism. My answer to this? Sure. You can go that way. Revalidation achieves a performance increase when compared to no caching whatsoever. Personally, I feel it's worth the effort to version assets, which prevents unnecessary trips to the server for stuff you know hasn't changed.

Less requests are particularly important for HTTP/1 client/server interactions. Even though this site runs on HTTP/2 and requests are less expensive in that context, a segment of my users are on browsers that don't support HTTP/2. When these people come to my site, they end up communicating with my server over HTTP/1. So I want to ease their pain, so to speak.

But my TTFBs!

Getting the checksum of multiple files on every single request regardless of whether any of them have changed can probably impact your site's time to first byte. So let's try to fix that with a little thing called tmpfs. tmpfs is a nifty feature on Unix-like systems that's analogous to what you might call a RAM drive. You mount a storage volume that functions like a hard drive, but rather than using a segment of available hard drive space, you use a small amount of RAM. SSDs are pretty fast these days, but RAM is still faster. Plus, it takes time for md5_file to churn out checksums, so if we can cache these as files in a stupidly fast tmpfs volume, we can avoid generating checksums every time a page is accessed. On my web server's file system, I went ahead and created a cache directory at /var/www like so:


mkdir -p /var/www/caches

Then in my /etc/fstab configuration where disk mounts are defined, I added this line:


tmpfs /var/www/caches tmpfs size=4m,uid=1001,gid=1001,mode=0755 0 0

For those of you uninitiated to how this works, we're mounting a tmpfs volume at /var/www/caches. We set a size value of 4m, which means we're allocating 4 megabytes to the volume. For the purposes of caching checksum strings, that's way more than what's needed, but it's good to have room to grow. The uid and gid values are the IDs for the the user and group that own the volume. In this case, the user and group is the owner of the httpd server process specified in my server configuration. mode is the chmod value that defines the permissions for the volume. In this case, I specified 0755, which means I want everybody to be able to read and execute contents on the volume, but I only want the httpd process owner to be able to read, execute and write to the tmpfs volume. The last two 0s turn off automatic backups and integrity (fsck) checking. After adding this new volume, I'll run mount -a, which will mount all drives mentioned in /etc/fstab (at least in CentOS.) Because this definition is in fstab, it'll get remounted if the system reboots.

All that's left to do now is modify the cacheString function to read from/write to the tmpfs volume for checksum strings:


function cacheString($cacheKey, $string, $pathPrefix){
    $checksumCacheDir = "/var/www/caches/cache-keys/";
    $checksumCache = "/var/www/caches/cache-keys/" . $cacheKey;

    // Check if the checksum string is in the cache.
    if(file_exists($checksumCache)){
        // Return the checksum string from the cache.
        return file_get_contents($checksumCache);
    }
    else{
        // Make checksum cache directory if it doesn't exist
        if(is_dir($checksumCacheDir) === false){
            mkdir($checksumCacheDir, 0755);
        }

        // Generate checksum if it doesn't exist, place it in the cache, and return the checksum string
        $checksum = substr(md5_file($pathPrefix . $string), 0, 8);
        file_put_contents($checksumCache, $checksum, LOCK_EX);
        return $checksum;
    }
}

Boom. My site's global assets are now automatically versioned. Whenever I'm ready to push updated assets to production on this site, I run a deploy script. Part of the deploy script empties the /var/www/caches/ directory, which allows new checksums to be generated. These caches stay primed until the next site update. Did I overdo it? Yeah, probably, but the good news is that I don't have to worry about versioning my assets ever again. I just write code, check it into my git repository, and deploy.

How do you invalidate caches on your site? Any ideas for improvements? Is my process awful? Chime in below!