Fork me on Github
Fork me on Github

Joe Dog Software

Proudly serving the Internets since 1999

Improve Performance with Cache-Control Headers

Most Web sites contain static elements that are shared by several pages. Each page on a template-driven site will likely contain common elements such as style sheets, Java scripts, and images. As a browser parses HTML, it looks for items required to construct the page. If each request requires the browser to repeatedly download the same elements, a lot of unnecessary bandwidth will be consumed in just a few short clicks.

The problem is compounded in periods of peak activity. Flash crowds keep web systems administrators awake at night. If a hypothetical browser is one of thousands grabbing a breaking news item or following a link from Slashdot, our administrator will want to conserve all possible bandwidth. For this reason, HTTP protocol contains cache directives that can be used to limit bandwidth usage and improve web site performance.

Overview

When a browser downloads a file, it may store a copy locally. In other words, it may “cache” the document. If a stored element is required again, then the browser may simply pull a copy from its cache. This saves time time and bandwidth. Your customer is happy and so is your boss. A browser may download documents once and use the repetitively. Our tireless administrator might get a sound sleep after all. Of course, there is no free lunch. There are problems associated with this solution. For one thing, most humans can’t afford an infinite cache. For another, documents change. The browser overcomes the first limitation by expiring things from cache. It overcomes the second when it checks the server to see if its copy is up-to-date.

HTTP protocol provides several directives to facilitate document caching. When a user clicks a link, the browser or its proxy may store elements for reuse. During this transaction, the Web server may affix an expiration date to each document. This designates its “freshness” period. Just as you may safely refrigerated  sour cream before its expiration date, a browser may use cached items without checking with the server as long as those items are still “fresh.” If a cached document is available after its expiration, after it has become “stale,” then the browser must ask the web server whether or not it may still be used.  This is known as “cache revalidation.”

The most frequently used mechanism to revalidate a cached document is the If-Modified-Since header. If the browser has a stale document in its cache, it may request that document on the condition that its version is no longer valid. In this situation, the web server has two options. It may tell the browser to use the copy from its cache, or it may deliver a new document. To illustrate this interaction between a browser and server, let’s examine an HTTP header exchange.

In the first scenario, the web server sends HTTP code 304 to indicate the cached copy is still valid. The browser may use the one it has. Here’s the transaction:

 GET /images/limey.gif HTTP/1.1
 Host: www.joedog.org:80
 Cookie:
 Accept: */*
 Accept-Encoding: *
 User-Agent: Mozilla/5.0 (X11; U; AIX 4.2; en-US; rv:1.4) Gecko/20031128
 If-Modified-Since: Tue, 01 Jun 2004 05:10:39 GMT
 Connection: close
 HTTP/1.1 304 Not Modified
 Date: Thu, 09 Sep 2004 13:11:52 GMT
 Server: Apache/1.3.29 (Unix) PHP/4.3.6 mod_ssl/2.8.16 OpenSSL/0.9.7d
 Connection: close
 ETag: "f13b-51bd-40b21ef5"

In a second scenario, the server sends HTTP code 200 along with a new copy of the document:

 GET /images/limey.gif HTTP/1.1
 Host: www.joedog.org:80
 Cookie:
 Accept: */*
 Accept-Encoding: *
 User-Agent: Mozilla/5.0 (X11; U; AIX 4.2; en-US; rv:1.4) Gecko/20031128
 If-Modified-Since: Mon, 24 May 2004 16:12:36 GMT
 Connection: close
 HTTP/1.1 200 OK
 Date: Thu, 09 Sep 2004 13:15:08 GMT
 Server: Apache/1.3.29 (Unix) PHP/4.3.6 mod_ssl/2.8.16 OpenSSL/0.9.7d
 Last-Modified: Mon, 24 May 2004 16:12:37 GMT
 ETag: "f13b-51bd-40b21ef5"
 Accept-Ranges: bytes
 Content-Length: 20925
 Connection: close
 Content-Type: image/gif

In both situations, our browser opened a connection to the web server. But in the first request, the only thing we exchanged was HTTP header information. In the second case, the browser pulled down a 21K file. It doesn’t take much imagination to see how HTTP cache control can save bandwidth especially on a web site with a lot of common templates.

As human beings, we understand our web site better than a browser or a web server. We know which elements will be modified and which ones will not. Some elements aren’t going to change between now and the end of time, i.e., spacer.gif. It would be nice if we could provide some input to influence cache control.

Fortunately HTTP protocol provides mechanisms that allow us to specify document expirations. The Cache-Control header can be used to set the maximum age in seconds of a cached document. This is the elapsed time from when it was generated until when it can no longer be served. For example, this directive tells the cache to expire the document two hours from now: Cache-Control: max-age=640000 The Expires header allows us to set an absolute expiration date. Once that moment has passed, the document is considered stale. Here is another example: Expires: Mon, 13 Sep 2004, 16:00:00 GMT Of the two, Cache-Control is preferable. The Expires directive depends on clock synchronization. The proliferation of appliance clocks that blink “12:00” should suggest a problem with that dependency.

The Apache web server provides several mechanisms that allow us to explicitly set cache directives. The Expires module, a.k.a. mod_expires, is bundled with Apache 1.2 and higher.  It allows the administrator to set the Cache-Control and Expires HTTP headers. Expirations may be set according to either a file’s modification time or the last time it was accessed by the client. We can configure documents to expire immediately or well into the distant future.

The module contains two directives for setting expirations. ExpiresDefault sets the expiration time for an entire server configuration, a virtual host or a directory. ExpiresByType allows you to set expirations by MIME type, i.e., expire the cache for every jpeg in /images/maps one hour from now. The syntax for both directives look like this:

 ExpiresDefault "<base> [plus] {<num> <type>}*"
 ExpiresByType type/encoding "<base> [plus] {<num> <type>}*"

Now let’s consider the key components in the directives above, base, plus, num and type. <base> is a reference time. It may be set to either “now” or “modification.”  “Now” refers to the access time and “modification” refers to the file’s modification time. The second component is an optional keyword. [plus] makes the configuration easier to understand, “now plus time” makes more sense than “now time.” The final two components are coupled together. <num> is an integer and <type> describes it. For example, “now plus 1 day” expires twenty-four hours after access where as “now plus 0 seconds” expires immediately. The module supports the following types: years, months, weeks, days, hours, minutes and seconds.
If this seems confusing, don’t worry it’s not. Let’s consider some sample configurations:

<Directory "/data/www/public_html/images/">
  <IfModule mod_expires.c>
     ExpiresAction On
     ExpiresDefault "now plus 2 weeks”
   </IfModule>
 </Directory>

In the preceding example, we introduced the ExpiresActive directive. It takes a single argument that is either “on” or “off.” This enables or disables Cache-control / Expires header. The configuration is applied at directory level. We used the ExpiresDefault directive to expire every thing in the directory two hours after it’s accessed.

We could also expire items at different times and by different types with the ExpiresByType directive:

<Directory "/data/www/public_html/common/">
  <IfModule mod_expires.c>
    ExpiresActive On
    ExpiresByType image/png  "now plus 24 hours"
    ExpiresByType image/gif  "now plus 0 minutes"
    ExpiresByType text/css   "now plus 2 hours"
  </IfModule>
</Directory>

Apache provides another mechanism to send cache control instructions to the client. The Headers module, mod_headers, lets administrators customize HTTP response headers.  If we can write our own headers, then why not write Cache-control or Expires headers? The module contains two directives that allow us customize HTTP response. Header allows us to write 1xx and 2xx headers and ErrorHeader allows us to write 3xx, 4xx and 5xx headers. The syntax looks like this:

 Header set|append|add|unset <header> <value>
 ErrorHeader set|append|add|unset <header> <value>

For our purpose, set and unset are the most important arguments. The former sets the response header and replaces any existing ones with the same name. The latter removes the header; if several headers with the same name exist, then it unsets them all. Consider the following example:

<FilesMatch "*.gif">
  <IfModule mod_headers.c>
    Header set Cache-control max-age=9200
  </IfModule>
</FilesMatch>

We can send multiple Cache-control headers to the client. If we want the client to revalidate the document and not store it in cache, we can send this combination:

<FilesMatch "*.gif">
  <IfModule mod_headers.c>
    Header set Cache-control “no-cache, no-store”
  </IfModule>
</FilesMatch>

In the example above, no-cache tells the client to revalidate the document and no-store instructs it not to place the document in cache. Here are some other Cache-control headers to consider: max-age=num sets the freshness period in seconds and must-revalidate requires the client to always revalidate. For a complete list of Cache-control headers and their meanings, see RFC 2616.

From the examples above, it’s obvious that we can set cache directives by manipulating HTTP response headers. While those modules make it easy to set cache controls for large portions of a web site, as long as we can write response headers, we can send those instructions. Consider this CGI script:

 #! /bin/sh
 echo Content-type: text/plain
 echo Cache-control: must-revalidate
 echo
 echo Hello, world.

We can also embed these directives in HTML 2.0 and higher:

<html>
  <head>
  <title>Hello, World</title>
    <meta http-equiv=”Cache-control” content=”must-revalidate”>
  </head>
  <body><b>Hello, World</b></body>
</html>

See RFC1866 for more information.

The techniques we’ve discussed provide a web systems administrator with a means to reduce latency, save bandwidth and decrease server load. Best of all, they require no out-of-pocket expenses. A log analysis tool can demonstrate improvement from one month to the next. The perfect time to present your boss with such reports is one month before raises are determined. Enjoy.