Detecting Bots and Spiders with Plack Middleware

| 7 Comments

If you analyze the user requests of your web site, you'll have to deal with enormous numbers of bots and spiders and other automated requests for your resources which don't represent measurable users. As promised in Annotating User Events for Cohort Analysis, here's how I handle them.

I wrote a tiny piece of Plack middleware which I enabled in the .psgi file which bundles my application:

package MyApp::Plack::Middleware::BotDetector;
# ABSTRACT: Plack middleware to identify bots and spiders

use Modern::Perl;
use Plack::Request;
use Regexp::Assemble;

use parent 'Plack::Middleware';

my $bot_regex = make_bot_regex();

sub call
{
    my ($self, $env) = @_;
    my $req          = Plack::Request->new( $env );
    my $user_agent   = $req->user_agent;

    if ($user_agent)
    {
        $env->{'BotDetector.looks-like-bot'} = 1 if $user_agent =~ qr/$bot_regex/;
    }

    return $self->app->( $env );
}

sub make_bot_regex
{
    my $ra = Regexp::Assemble->new;
    while (<DATA>)
    {
        chomp;
        $ra->add( '\b' . quotemeta( $_ ) . '\b' );
    }

    return $ra->re;
}

1;
__DATA__
Baiduspider
Googlebot
YandexBot
AdsBot-Google
AdsBot-Google-Mobile
bingbot
facebookexternalhit
libwww-perl
aiHitBot
Baiduspider+
aiHitBot
aiHitBot-BP
NetcraftSurveyAgent
Google-Site-Verification
W3C_Validator
ia_archiver
Nessus
UnwindFetchor
Butterfly
Netcraft Web Server Survey
Twitterbot
PaperLiBot
Add Catalog
1PasswordThumbs
MJ12bot
SmartLinksAddon
YahooCacheSystem
TweetmemeBot
CJNetworkQuality
YandexImages
StatusNet
Untiny
Feedfetcher-Google
DCPbot
AppEngine-Google

Plack middleware wraps around the application to examine and possibly modify the incoming request, to call the application (or the next piece of middleware), and to examine and possibly modify the outgoing response. Plack conforms to the PSGI specification to make this possible.

Update: This middleware is now available as Plack::Middleware::BotDetector from the CPAN. Thanks to Big Blue Marble and Trendshare for sponsoring its development and release.

All of that means that any piece of middleware gets activated by something which calls its call() method, passing in the incoming request as the first parameter. This request is a hash with specified keys. The application, or at least the next piece of middleware to call, is available from object's accessor method app().

(I'm lazy. I use Plack::Request to turn $env into an object. This is not necessary.)

The rest of the code is really simple. I have a list of unique segments of the user agent strings I've seen in this application. I use Regexp::Assemble to turn these words into a single (efficient) regex. If the incoming request's user agent string matches anything in the regex, I add a new entry to the environment hash.

With that in place, any other piece of middleware executed after this point in the request—or the application itself—can examine the environment and choose different behavior based on the bot-looking-ness if any request. My cohort event logger method looks like:

=head2 log_cohort_event

Logs a cohort event. At the end of the request, these get cleared.

=cut

sub log_cohort_event
{
    my ($self, %event)  = @_;
    return if $self->request->env->{'BotDetector.looks-like-bot'};
    $event{usertoken} ||= $self->sessionid || 'unknownuser';

    push @{ $self->cohort_events }, \%event;
}

The embolded line is all it took in my application to stop logging cohort events for spiders. If and when I see a new spider in the logs, I can exclude it by adding a line to the middleware's DATA section and restarting the server.

(You might rather store this information in a database, but I'd rather build the regex once than loop through a database with a LIKE query. I haven't found an ideal alternate solution, which is why I haven't put this on the CPAN. Perhaps this is two modules, one for the middleware and one which exports a regex to identify spider user agents.)

There's one more trick to this cohort event logging: traceability. That's the topic for next time.

7 Comments

My take away is that there is probably a need for a "Acme::RobotsRegex" module on cpan? Perhaps better named.

I am also sitting here thinking about how interesting it would be to do monitoring functions via Plack plugins. Like latency monitoring, or perhaps cross site scripting detection. Functions like those in "introscope" but without involving CA

There should be a space between "package" and "MyApp" on line 1 of the package source. :)

Nice use of Regexp::Assemble and a __DATA__ section! Although I have yet to play with Plack, I like this idea of using middleware for some intermediary logic.

I like the idea of a middleware module for this. But wouldn't it be better to use (and perhaps update) one of the existing modules for identifying bots?

Neil Bowers wrote this nice article
http://neilb.org/reviews/user-agent.html

I didn't do this for two reasons.

First, I didn't find these modules (or the review) on my initial search of the CPAN. Second, Neil's conclusions show that the tradeoff between accuracy and speed is awful for the cases he saw. What I have now is definitely not optimal from a code and reuse standpoint, but it doesn't slow down every request dramatically.

I tested your code against more than 10000 user agent strings (from http://useragentstring.com/pages/All/). On my laptop it took about quarter of second to filter out 80 bot strings:

$ time perl test__bot_detector.pl
     80
real    0m0.255s
user    0m0.236s
sys     0m0.020s

Not bad at all, thank you!

this sounds way too complex and heavy

need a lightweight way to detect NEW bots ... from their behavior

it it has to run on every page load it would nead to be EXTREMELY lightweight .. ie no heavier that hitting one or two small files ... no mysql even .. so as not to add significant latency to page loads and allow it to quickly detect and block some of those really nasty ones with insane request rates!


so the detection methods will need to cope with pretty hardcore request rates long enough to detect and block!

Michael, you could try including three hidden links somewhere on your homepage a la:

<a href="/bot-detect/1"></a>
<a href="/bot-detect/2"></a>
<a href="/bot-detect/3"></a>

An IP address that hits more than one of those links is flagged as a suspected bot.

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

sponsored by the How to Make a Smoothie guide

Categories

Pages

About this Entry

This page contains a single entry by chromatic published on August 20, 2012 5:53 PM.

Annotating User Events for Cohort Analysis was the previous entry in this blog.

Refining Data Collection for Cohort Tracking is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.


Powered by the Perl programming language

what is programming?