Most websites contain static elements shared across multiple pages. Each page on a template-driven site will likely contain common elements such as style sheets, JavaScript, and images. As a browser parses HTML, it looks for items required to construct the page. If each request requires the browser to repeatedly download the same elements, a lot of unnecessary bandwidth will be consumed in just a few short clicks.
The problem is compounded in periods of peak activity. Flash crowds keep web systems administrators awake at night. If a hypothetical browser is one of thousands grabbing a breaking news item or following a link from Slashdot, our intrepid administrator will want to conserve all possible bandwidth. For this reason, the HTTP protocol includes cache directives that can limit bandwidth usage and improve website performance.
Overview
When a browser downloads a file, it may store a copy locally. In other words, it may “cache” the document. If a stored element is needed again, the browser may simply retrieve a copy from its cache. This saves time time and bandwidth. Your customer is happy, and so is your boss. A browser may download documents once and use them repeatedly. Our tireless administrator might get a sound sleep after all. Of course, there is no free lunch. There are problems associated with this solution.
For one thing, most humans can't afford an infinite cache. For another, documents change. The browser overcomes the first limitation by expiring cached items. It overcomes the second by checking the server to see if its copy is up to date.
The HTTP protocol provides several directives to facilitate document caching. When a user clicks a link, the browser or its proxy may store elements for reuse. During this transaction, the Web server may affix an expiration date to each document. This designates its “freshness” period. Just as you may safely refrigerate sour cream before its expiration date, a browser may use cached items without checking with the server as long as those items are still “fresh.” If a cached document is available after its expiration, after it has become “stale,” then the browser must ask the web server whether or not it may still be used. This is known as “cache revalidation.”
The most commonly used mechanism for revalidating a cached document is the If-Modified-Since header. If the browser has a stale document in its cache, it may request that document on the condition that its version is no longer valid. In this situation, the web server has two options. It may instruct the browser to use the cached copy, or it may deliver a new document. To illustrate this interaction between a browser and server, let’s examine an HTTP header exchange.
In the first scenario, the web server returns HTTP 304 to indicate that the cached copy is still valid. The browser may use the one it has. Here's the transaction:
GET /images/limey.gif HTTP/1.1Host: www.joedog.org:80Cookie:Accept: */*Accept-Encoding: *User-Agent: Mozilla/5.0 (X11; U; AIX 4.2; en-US; rv:1.4) Gecko/20031128If-Modified-Since: Tue, 01 Jun 2004 05:10:39 GMTConnection: closeHTTP/1.1 304 Not ModifiedDate: Thu, 09 Sep 2004 13:11:52 GMTServer: Apache/1.3.29 (Unix) PHP/4.3.6 mod_ssl/2.8.16 OpenSSL/0.9.7dConnection: closeETag: "f13b-51bd-40b21ef5"
In a second scenario, the server sends HTTP code 200 along with a new copy of the document:
GET /images/limey.gif HTTP/1.1Host: www.joedog.org:80Cookie:Accept: */*Accept-Encoding: *User-Agent: Mozilla/5.0 (X11; U; AIX 4.2; en-US; rv:1.4) Gecko/20031128If-Modified-Since: Mon, 24 May 2004 16:12:36 GMTConnection: closeHTTP/1.1 200 OKDate: Thu, 09 Sep 2004 13:15:08 GMTServer: Apache/1.3.29 (Unix) PHP/4.3.6 mod_ssl/2.8.16 OpenSSL/0.9.7dLast-Modified: Mon, 24 May 2004 16:12:37 GMTETag: "f13b-51bd-40b21ef5"Accept-Ranges: bytesContent-Length: 20925Connection: closeContent-Type: image/gif
In both situations, our browser opened a connection to the web server. But in the first request, the only information exchanged was HTTP headers. In the second case, the browser pulled down a 21K file. It doesn’t take much imagination to see how HTTP cache control can save bandwidth, especially on a website with many standard templates.
As human beings, we understand our website better than a browser or a web server. We know which elements will be modified and which ones will not. Some elements won’t change between now and the end of time, e.g., spacer.gif. It would be nice if we could provide some input to influence cache control.
Fortunately, the HTTP protocol provides mechanisms to specify document expiration. The Cache-Control header can be used to set the maximum age of a cached document, in seconds. This is the elapsed time from generation until it can no longer be served. For example, this directive tells the cache to expire the document two hours from now: Cache-Control: max-age=640000
The Expires header allows us to set an absolute expiration date. Once that moment has passed, the document is considered stale. Here is another example: Expires: Mon, 13 Sep 2004, 16:00:00 GMT. Of the two, Cache-Control is preferable. The Expires directive depends on clock synchronization. The proliferation of appliance clocks that blink “12:00” should suggest a problem with that dependency.
The Apache web server provides several mechanisms for explicitly setting cache directives. The Expires module, also known as mod_expires, is bundled with Apache 1.2 and later. It allows the administrator to set the Cache-Control and Expires HTTP headers. Expirations may be set according to either a file’s modification time or the last time the client accessed it. We can configure documents to expire immediately or well into the distant future.
The module contains two directives for setting expirations. ExpiresDefault sets the expiration time for an entire server configuration, a virtual host, or a directory. ExpiresByType allows you to set expiration by MIME type, e.g., expire the cache for every JPEG in /images/maps one hour from now. The syntax for both directives looks like this:
ExpiresDefault "<base> [plus] {<num> <type>}*" ExpiresByType type/encoding "<base> [plus] {<num> <type>}*"
Now let’s consider the key components in the directives above: base, plus, num, and type. <base> is a reference time. It may be set to either “now” or “modification.” “Now” refers to the access time, and “modification” refers to the file’s modification time. The second component is an optional keyword. [plus] makes the configuration easier to understand. “Now plus time” makes more sense than “now time.” The final two components are coupled together. <num> is an integer and <type> describes it. For example, “now plus 1 day” expires twenty-four hours after access, whereas “now plus 0 seconds” expires immediately. The module supports the following types: years, months, weeks, days, hours, minutes, and seconds.
If this seems confusing, don't worry, it's not. Let’s consider some sample configurations:
<Directory "/data/www/public_html/images/"> <IfModule mod_expires.c> ExpiresAction On ExpiresDefault "now plus 2 weeks” </IfModule></Directory>
In the preceding example, we introduced the ExpiresActive directive. It takes a single argument that is either “on” or “off.” This enables or disables the Cache-Control/Expires header. The configuration is applied at the directory level. We used the ExpiresDefault directive to expire everything in the dir ectory two hours after it’s accessed.
We could also expire items at different times and by different types with the ExpiresByType directive:
<Directory "/data/www/public_html/common/"> <IfModule mod_expires.c> ExpiresActive On ExpiresByType image/png "now plus 24 hours" ExpiresByType image/gif "now plus 0 minutes" ExpiresByType text/css "now plus 2 hours" </IfModule></Directory>
Apache provides another mechanism for sending cache-control directives to the client. The Headers module, mod_headers, lets administrators customize HTTP response headers. If we can write our own headers, then why not write Cache-Control or Expires headers? The module contains two directives that allow us to customize the HTTP response. Header enables us to write 1xx and 2xx headers, and ErrorHeader lets us to write 3xx, 4xx, and 5xx headers. The syntax looks like this:
Header set|append|add|unset <header> <value> ErrorHeader set|append|add|unset <header> <value>
For our purpose, set and unset are the most important arguments. The former sets the response header and replaces any existing ones with the same name. The latter removes the header; if several headers with the same name exist, then it unsets them all. Consider the following example:
<FilesMatch "*.gif"> <IfModule mod_headers.c> Header set Cache-control max-age=9200 </IfModule></FilesMatch>
We can send multiple Cache-Control headers to the client. If we want the client to revalidate the document and not store it in cache, we can send this combination:
<FilesMatch "*.gif"> <IfModule mod_headers.c> Header set Cache-control “no-cache, no-store” </IfModule></FilesMatch>
In the example above, no-cache tells the client to revalidate the document, and no-store instructs it not to place the document in cache. Here are some other Cache-Control headers to consider: max-age=num sets the freshness period in seconds, and must-revalidate requires the client to always revalidate. For a complete list of Cache-Control headers and their meanings, see RFC 2616.
From the examples above, it’s evident that we can set cache directives by manipulating HTTP response headers. While those modules make it easy to set cache controls for large portions of a website, as long as we can write response headers, we can send those instructions. Consider this CGI script:
#! /bin/shecho Content-type: text/plainecho Cache-control: must-revalidateechoecho Hello, world.
We can also embed these directives in HTML 2.0 and higher:
<html> <head> <title>Hello, World</title> <meta http-equiv=”Cache-control” content=”must-revalidate”> </head> <body><b>Hello, World</b></body></html>
See RFC1866 for more information.
The techniques we’ve discussed enable a web systems administrator to reduce latency, save bandwidth, and decrease server load. Best of all, they require no out-of-pocket expenses. A log analysis tool can demonstrate improvement from one month to the next. The perfect time to present your boss with such reports is one month before raises are determined. Enjoy.