Fork me on Github
Fork me on Github

Joe Dog Software

Proudly serving the Internets since 1999

up arrow Fido 1.1.4 And Google’s Last Crawl

Your JoeDog can now announce the release of fido-1.1.4. To illustrate its awesome new feature we’re going to do an exercise.

“Wait. Homework? This blog sucks!”

Yeah, but this homework but this is useful. We’re going to capture the time of Google’s last crawl so you can be a hero with the SEO nerds in your company.

“Okay, fine!”

We can identify the googlebot by its User-agent. Google conveniently refers to itself as Googlebot. Here’s an entry in Your JoeDog’s logs:

208.78.85.241 - - [23/Nov/2014:06:16:24 -0500] "GET /blog/ HTTP/1.1" 
   200 57787 "-" "Googlebot/2.X (+http://www.googlebot.com/bot.html)"

Unfortunately, anybody can masquerade as a googlebot. The only way we can be certain this agent is authentic is to check its IP address.

Pom $ dig -x 208.78.85.241
;; ANSWER SECTION:
241.85.78.208.in-addr.arpa. 1316 IN PTR host241.subnet-208-78-85.gigavenue.com.

Wait a second! That’s not Google, that’s a fraud. Let’s check another entry:

Pom $ dig -x 66.249.65.47
;; ANSWER SECTION:
47.65.249.66.in-addr.arpa. 11310 IN PTR crawl-66-249-65-47.googlebot.com.

Okay, that’s Google. So in order to validate and record the time of Google’s last crawl, we have to check the IP address. How do we achieve this?

We’ll use fido to check our logs for instances of Googlebot but it can’t validate the IP address. Our action program can do that but how do we pass it the address?  Prior to fido-1.1.4, that would have been impossible. Starting with 1.1.4 it can now do regex capture and pass those variables to the action program.

To set this up, you’ll need to a file block in fido.conf which points to your access_log.

/var/log/httpd/joedog-access_log {
 rules = ^([0-9]+.[0-9]+.[0-9]+.[0-9]+).*GoogleBot
 action = /home/jeff/bin/googler $1
}

When fido locates a match, it will capture everything inside the parentheses and send that to the googler script as $1. Here’s the googler script:

#!/usr/bin/perl
use Socket;
use strict;
use vars qw($LOCK_EX $LOCK_UN);
$LOCK_EX = 2;
$LOCK_UN = 8;
my $addr = $ARGV[0];
my $host = gethostbyaddr(inet_aton($addr), AF_INET);
if ($host !~ /.googlebot.com$/) {
 print "ERROR: Forged User-agent ($host)n";
 exit;
}
my $file = "/path/to/joedog.org/google.txt";
open (FILE, ">>$file") or die "Unable to open file: $filen";
flock(FILE, $LOCK_EX);
print FILE timestamp()." | $addrn";
flock(FILE, $LOCK_UN);
exit;
# returns a string in the following format:
# YYYYMMDDHHMMSS
sub timestamp() {
 my $now = time;
 my @date = localtime $now;
 $date[5] += 1900;
 $date[4] += 1;
 my $stamp = sprintf(
 "%02d/%02d/%04d %02d:%02d:%02d",
 $date[4],$date[3],$date[5], $date[2], $date[1], $date[0]
 );
 $stamp .= " | ";
 $stamp .= sprintf(
 "%04d%02d%02d%02d%02d",
 $date[5],$date[4],$date[3], $date[2], $date[1], $date[0]
 );
 return $stamp;
}
sub empty { ! defined $_[0] || ! length $_[0] }

As you can imagine, there’s lots of creative things you can do with the googler script. Your JoeDog hopes to compare crawl frequencies against the site’s freshness to see if there’s a correlation.

[JoeDog: Last Google Crawl]

UPDATE:  As Tim notes in the comments below, the Internets are full of jerks.

Consider this hostname: haha.googlebot.com.fooledyou.ru

The regex has been changed so that the fully qualified hostname must end in .com