Annotating User Events for Cohort Analysis

The heart of every successful agile or iterative process is thoughtful measurement and refinement. This requires measurement. In software development terms, you might ask "Can we deploy features more quickly?" or "Can we provide more accurate estimates?" or "Can we improve quality and reduce defects?"

In business terms—especially in startups and other small businesses searching for niches and customers and revenue—you might ask "How can we improve customer engagement?" and "How can we improve our rates of visitor to paying customer conversion?"

I've been experimenting with something called cohort analysis lately. The results are heartening. In short, you instrument your code to record notable user events. Then you analyze them.

I started by adding a single table to my database:

CREATE TABLE cohort_log
(
    id         INTEGER      PRIMARY KEY AUTOINCREMENT,
    usertoken  VARCHAR(255) NOT NULL,
    day        INTEGER      NOT NULL,
    month      INTEGER      NOT NULL,
    year       INTEGER      NOT NULL,
    event      TEXT(25)     NOT NULL,
    notes      VARCHAR(255) DEFAULT ''
);

A user may generate multiple events. Every event has a canonical name. I haven't made these into a formal enumeration yet, but that's on my list. Every event has a token I'll explain soon. Every event also has a notes field for additional information, such as the user-agent string for the "new visitor has appeared" event or the name of the offending template for the "wow, there's a bug in the template and the system had to bail out this request!" event.

(Separating the timestamp into discrete components is a deliberate denormalization I don't necessarily recommend for your uses. There's a reason for it, but I won't tell you which side of the argument I argued.)

I use DBIx::Class to help manage our data layer, so I have a CohortLog class. The resultset includes several methods to help generate reports, but it also has a special method to insert a new event into the table:

=head2 log_event

Given a hash reference containing key/value pairs of C<usertoken>, C<event>,
and optionally C<notes>, logs a new cohort event. Throws an exception without
both required keys.

=cut

sub log_event
{
    my ($self, $args) = @_;

    do { die "Missing cohort event parameter '$_'\n" unless $args->{$_} }
        for qw( usertoken event );

    my $dt    = DateTime->now;
    $args->{$_} = $dt->$_ for qw( year month day );

    $self->create( $args );
}

This automatically inserts the current (timezone-adjusted) time values into the appropriate columns. (Again, a good default value in the database would make this work correctly, but we're sticking with this tradeoff for now.)

I added a couple of methods to the Catalyst context object so as to log these events:

=head2 log_cohort_event

Logs a cohort event. At the end of the request, these get cleared.

=cut

sub log_cohort_event
{
    my ($self, %event)  = @_;
    $event{usertoken} ||= $self->sessionid || 'unknownuser';

    push @{ $self->cohort_events }, \%event;
}

=head2 log_cohort_template_error

Turns the previous cohort event into a template error.

=cut

sub log_cohort_template_error
{
    my $self     = shift;
    my $template = $self->stash->{template};
    my $page     = $self->stash->{page} || '';
    my $event    = @{ $self->cohort_events }[-1];

    $event->{event}  = 'TEMPLATEERROR';
    $event->{notes} .= $template . ' ' . $page;
}

=head2 record_cohort_events

=cut

sub record_cohort_events
{
    my $self          = shift;
    my $events        = $self->cohort_events;
    my $cohort_log_rs = $self->model( 'DB::CohortLog' );

    for my $event (@$events)
    {
        $cohort_log_rs->log_event( $event );
    }

    @$events = ();
}

The most important method is log_cohort_event(), which takes named parameters corresponding to the cohort's data. The token associated with each event comes from the user's session id. (You can see a couple of flaws to work around, namely that some requests have no session information, such as those from bots and spiders, and that session ids may change over time. There are ways to work around these.)

The log_cohort_template_error() method is more diagnostic in nature. It modifies the previous event to record an error in the template, as there's no sense in recording that a user performed an event when that event never occurred successfully. (Another part of the system detects these catastrophic events and calls this method. Hopefully it never gets called.)

Finally, record_cohort_events() inserts these events into the database. This method gets called at the end of the request, after everything has rendered properly and has been sent to the user. This prevents any error in the event system from causing the request to fail and it reduces the apparent user latency.

How does it look to use this logging? It's almost trivial:

=head2 index

The root page (/)

=cut

sub index :Path :Args(0)
{
    my ( $self, $c ) = @_;

    $c->log_cohort_event( event => 'VIEWEDHOMEPAGE' );
    $c->stash( template => 'index.tt' );
}

=head2 send_feedback

Allows the user to send feedback about what just happened.

=cut

sub send_feedback :Path('/send_feedback') :Args(0)
{
    my ($self, $c) = @_;
    my $method     = lc $c->req->method;

    return $c->res->redirect( '/users' ) unless $method eq 'post';

    my $params     = $self->get_params_for( $c, 'feedback' );
    $c->model( 'UserMail' )->send_feedback( $c, $params );

    $c->add_message( 'Feedback received! '.
                     'Thanks for helping us make things better!' );

    $c->log_cohort_event( event => 'SENTFEEDBACK' );
    return $c->res->redirect( $params->{path} || '/users' );
}

These two controller actions each call $c->log_cohort_event with a specific event string. (Again, these could easily be constants generated from an enumeration in the database, but we haven't needed to formalize them yet.) While I considered making a Catalyst method attribute (like :Local or :Args to enforce this logging with an annotation, we decided that the flexibility of logging an event selectively outweighed the syntactic concerns of adding a line of code. Only after a user has actually sent feedback, for example, does the SENTFEEDBACK event get logged.

Testing for this logging is almost trivial.

Reporting is slightly more interesting, but how you do that depends on how you divide your userset into distinct cohorts.

The last exciting problem is how to detect spiders, bots, and other non-human user agents to exclude them from this analysis. Optimizing the sales and conversion and retention and engagement funnels for automated processes makes little sense. I have some ideas—some of them amazing failures—but that's a story for another time.

Annotating User Events for Cohort Analysis

Tags:

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry