August 2012 Archives

Speeding Up My Test Suite by 25%

| 7 Comments

One of my work projects has a test suite that's grown a little too slow for my liking. With only a few hundred tests, I've seen our test runs take over 30 seconds in a couple of cases. We're diligent about separating tests into individual logical files, so that when we work on a feature we can run only its tests as we develop and save running the whole suite for the last verification before checkin, but 30 seconds is still too long.

(Before you ask, our test suite is fully parallelizable, so that 30 seconds keeps all of the cores on a development box at full usage.)

I realize that one fad in software of the last few years is continuous integration, where you submit your commits to a box somewhere for that to run the full test suite and email you if something goes wrong, sometime in the future, hopefully less than an hour from now. Hopefully.

We can't afford to wait that long to know if our code is right. (I like to push to our production box as soon as we have well-tested fixes that we know work reliably. That's often more than once per day. I'm the only one working on the site today, and I've already pushed three changes, with at least one more coming.)

When I worked on Parrot, I spent a lot of time profiling the VM to sneak out small performance improvements here and there, using less memory and far fewer CPU cycles. I'm at home in a profiler, so I picked a very small test program and ran it through Devel::NYTProf.

I've mentioned before that this application has a Catalyst frontend which takes full advantage of Moose throughout. It uses DBIx::Class as its persistence layer.

As you might expect, startup time dominates most of the tests. We have 21 individual test files, and startup took close to two seconds for each of them. No matter how much we parallelized, we were always going to spend no less than 40 total seconds running tests. (The non-parallel test suite runs in ~67 seconds, so parallelization already gives a 55% improvement.)

When I looked at the profile, I saw that most of the startup time came from a couple of dependencies we didn't need.

In fact, even though I use HTML::FormHandler quite happily on another project, we don't use it on this project at all. We did use a Catalyst plugin which brought it in as a dependency, but we never used that feature.

I spent an hour revising that part of our application to remove that plugin (which we used to use, then customized our code so that the plugin was almost vestigial) and the parallel test run went down to 23 seconds.

We trimmed 10% of our dependencies, just by removing one plugin and the code it brought in.

(Tip: use Devel::TraceUse to find out which modules load which modules. It's a good way to get the conceptual weight of a dependency.)

I don't shy away from dependencies when they're useful, but I ended up revising one file and ending up with less code overall than we had before I started.

With that said, there's clearly room for improvement. I traced through some Moose code which handles initialization (especially of attributes) and it seems like there should be some efficiency gains there. Similarly, both Catalyst and DBIC go through some contortions to load plugins and components that in my case were unnecessary. (I have a patch to consider for the former, but not the latter.)

Two things I learned do bother me somewhat. If you use Devel::TraceUse on a project of any size, you'll see a lot of inconsistency and duplication, thanks to years of TIMTOWTDI. Catalyst brings in Moose, but DBIC eventually ends up bringing in Moo. No one can decide whether to use @ISA or use base '...'; or use parent '...';. A lot of larger projects (DateTime, Moose extensions) have lots of helper modules spread out in lots of tiny .pm files which each need to be located, loaded, and compiled.

Worse, all of the nice declarative syntax you get with some of these modern libraries has a cost behind the scenes every time you run the program even if nothing has changed.

If I were to pick an optimization target for future versions of Perl 5, I'd want a good way for Moose or Catalyst or whatever to precompile my nice sugared code into a cached representation that could then load faster on subsequent invocations. (I know you're not getting an order of magnitude improvement here: you still have to set up stubs for metaclasses and the like, but even getting attributes and methods in place would be an improvement!)

I'd settle today for a preforking test suite that doesn't require me to rewrite all of my tests.

Then again, that ~25% improvement in test execution from removing one dependency isn't bad for the week either.

In the persistent and recurring debate over rudeness, harassment, and other exclusionary events in technical circles, there's always that one persnickety little argument. You know the one.

You shouldn't feel offended. That's just silly.

Yet Another Bad Example

Take, for example, a cartoon comparing the Moose object system to elective breast enhancement surgery and the attractiveness of an actress before and after such surgery. (Hmm, it's even less amusing when I describe it that way.)

As you might rightly imagine, several people winced at yet another incident of the sexualization of something that's not sexual at all.

As you might sadly expect, the responses to that incident were attempts to wave away any disagreement by dismissing the possibility anyone might feel any justified offense or exclusion or discomfort.

If you're playing derailment bingo, grab an extra card.

This isn't difficult, my friends. It's actually very simple.

The Unnecessary Sexualization of Technology

People have sex. Yes, everyone understands that. Most of us in the world of technology are adults, or at least have gone through puberty. Everyone understands that too.

Yet when I as a heterosexual adult male go to a conference or a user group meeting or participate in a mailing list or a web forum or an IRC channel devoted to technology, I have no desire to talk about sex. Even though at least 90% of the other participants are likely straight adult males like me, the purpose of the group has nothing to do with sex or sexuality.

Now imagine that you're not in that 90% group and you want to talk about technology, but you have to wade through discussions of things that are completely unrelated, sensitive, and which continually point out to you that you are very different from that group. Constantly.

Imagine going to an event where your nametag says "I don't belong" and every conversation you try to have starts with the other person saying "Hi! You don't belong here! How are you today?"

(I went to a charity auction earlier this year where the bids started at multiple thousands of dollars for packages like a week's vacation on a private island and a chartered flight to one of the world's top golf spots. They had a wine wall where some bottles cost more than my suit. I left before dessert. I didn't belong.)

If you try to justify the sexualization of technology with "It's just a joke" or "It's a beautiful, normal part of human existence" or "Lighten up", you're behaving rudely. Please stop.

If you're thinking of complaining "But this means we can't talk about other things we like, like Dr. Who or comic books! Why are you censoring us? How boring it would be to talk about technology alone all the time!" then you're committing the slippery slope fallacy and demonstrating a false equivalence fallacy. Please instead discuss things like a rational adult with listening and reasoning skills.

Only I Get to Define "Offensiveness"

If telling other people they don't belong isn't bad enough, my favorite exhibit of bad behavior is telling people their feelings don't matter because only you get to decide what they can find offensive or exclusionary or mean-spirited or inappropriate.

This happens in many ways. (This list is not exhaustive.)

  • "But I asked a (insert token member of the group you supposedly "offended" here) and they said it was okay.
  • "It's just a joke, lighten up."
  • "You're misunderstanding the joke."
  • "You don't have a sense of humor."
  • "There's no logical reason for you to feel offended."
  • "You're making too big a deal out of this."
  • "You're making (insert the group you supposedly "offended" here) out to be humorless prigs, all too ready to jump on honest, good-hearted, handsome people like me, and you're shrill and probably no fun to hang out with in person too."
  • "I'm not as bad as (Hitler). Get some perspective."
  • "(Insert some other bad thing here which you perceive to be a bigger problem) is a bigger problem in the world. Why don't you solve that first?"
  • "Everyone I know in person says I'm not (insert some epithet), therefore I can never exhibit this bad behavior you ascribe to me. Thus it never happened. QED."
  • "Everyone's offended by something, so if we always sought to avoid offense, we'd never do anything. Therefore doing anything is going to offend, so you really need to get over it."
  • "Everyone else is stupid except me. Everyone else is an unenlightened boor except me. Bow down before your cultural and moral superior."
  • "If you interpreted it that way, it's your mind with the problem. Takes one to know one."
  • "You're the only one complaining. Silence implies consent. Therefore it's not really a big deal!"
  • "(Insert some group underrepresented in the community) don't like (the focus of the community) anyway. So what's the problem?"
  • "You're acting like a patronizing nanny, and we reasonable adults should be able to have a conversation without some humorless scold like you coming around to nag us all the time! Who are you to tell us how you feel?"
  • "Talking about this only makes the (insert community) community look bad! Why are you airing our dirty laundry? You're just doing this to get attention!"

In effect, what you are saying is "I refuse to take responsibility for what I said or did, and I'm going to place the blame on you the audience for how you feel about it. I believe I am a good person, and I can't reconcile the idea that I might have done something wrong, so my cognitive dissonance will instead claim that everyone else in the world is wrong, if necessary, just to save face."

You're better off saying "I didn't mean to cause any offense," but even that is a weak apology, because it's not your intention that decides how other people will react. Yes, you get some credit for not telling other people how to feel, but it's still a weak response.

The proper response is "I'm sorry. I didn't think about how other people might take this. I don't want to exclude other people unnecessarily, so I'll be more cautious about what I say and do in the future."

That's what a reasonable adult would do when caught in a mistake. That's how to defuse an honest mistake and not turn yourself into a pariah.

(Yes, I assume you're a reasonable and thoughtful adult. You're capable of empathizing with other people even if they're not exactly the same as you. Even if you have self-diagnosed yourself with some sort of social interaction disorder, I expect you to live up to this standard.)

The Bottom Line

No one's saying "Don't have fun" or "Be bland and non-offensive." (I shouldn't have to write this, but I know some of you are still sliding down that slippery slope argument. Stop it.)

What we are saying is this: not everyone is like you. Not everyone likes what you like. Not everyone grew up the same way you grew up. Not everyone had the same opportunities or skills or goals or education or experiences you did, so think about what you say and do and how it might affect other people before you say or do it.

If you make a mistake (everyone makes mistakes), at least make honest mistakes, and then own up to them quickly and honestly.

That's it. That's not too onerous a burden for intelligent, capable, mature adults. It's just empathy after all.

Writing good tests for code you care about is a creative process as much as it is an engineering discipline. You sometimes must find a balance between doing things the right way and doing things the effective way. The closer your tests are to the real usage patterns of your code the more accurate they are, but the faster and easier, the better your tests are to help you find defects.

Slow but accurate tests won't get run, while fast bug buggy tests won't produce any trust.

This dichotomy is often obvious in tests which must interact with multiple parts of a system, such as a database and a business logic layer. While many people say "Pull out your favorite mock object library and get to work!", I find mock objects are scalpels, not hammers, in that you need them for only a few specialized and precise operations, not a tool you'd use every day.

Consider databases.

One of my projects at work is a database-backed B2C web site. The schema is under fairly active development (I deployed a new version of the site with a big change to the schema an hour ago and have at least two more updates this week). I have a version of the live database handy for testing purposes (with user information scrubbed out, or where that's not possible, salted and hashed), but I almost never use it. (I only use it when verifying that a deployment will succeed before performing the deployment.)

Instead, I use DBICx::TestDatabase and a hand-selected subset of data for all of the tests. DBICx::TD takes my existing schema and creates a SQLite database in memory (not on the disk at all) and returns a DBIx::Class schema object connected to that database. My initialization code dumps in some fixture data, and then it's ready to use in all of my tests.

Again, that database is not on disk. This has at least two benefits. First, it's a lot faster than a database on disk, because it has to do less IO. Even if I'm very careful about wrapping operations in transactions, waiting for data to hit a platter is still a lot slower than waiting for data to hit RAM. Second, every process has its own unique database in memory and there's no contention for the disk or a single database file. My tests are trivially parallelizable in this respect.

All of the setup code is in my test library, t::DB. (I could use lib 't/lib';, but I haven't needed that yet.) The relevant code looks like this:

package t::DB;

use strict;
use warnings;

use Try::Tiny;
use DBICx::TestDatabase;
use MyApp ();
use Test::WWW::Mechanize::Catalyst 'MyApp';
use File::Touch;

make_cached_components();

my $schema;
sub make_schema { $schema ||= DBICx::TestDatabase->new( shift ) }

sub install_test_database
{
    my ($app, $schema) = @_;
    MyApp->model( 'DB' )->schema( $schema );
    MyApp->log->disable( 'warn' );
}

sub import
{
    my $self        = shift;
    my $appname     = 'MyApp';
    my $schema_name = $appname . '::Schema';
    my $schema      = make_schema( $schema_name );
    install_test_database( $appname, $schema );

    try
    {
        create_security_questions( $schema->resultset( 'SecurityQuestion' ) );
        create_user_roles(         $schema->resultset( 'Role'       ) );
        create_users(              $schema->resultset( 'User'       ) );
        create_invitations(        $schema->resultset( 'Invitation' ) );
        create_sectors(            $schema->resultset( 'Sector'     ) );
        create_industries(         $schema->resultset( 'Industry'   ) );
        create_stocks(             $schema->resultset( 'Stock'      ) );
    }
    catch
    {
        my $exception = $_;
        BAIL_OUT( 'Fixture creation failed: ' . $exception );
    };

    # ...
}

When a test file uses this library, the import() method creates a new test database for my application's schema. (The schema classes are already in memory thanks to the use lines which bring them in as a dependency of the app.) The make_schema() lines initialize a closure variable containing the schema object.

Before today, this code went to some contortions using Catalyst::Test to swap the existing application's schema with the generated schema, but now all it has to do is to call install_test_database(), and Catalyst behaves as if it had connected to the schema using the in-memory database from the start.

As a bonus, I removed a couple of dependencies and deleted a few lines of ugly code.

You can take this approach and use it on your own (unless you have complex database logic SQLite can't handle, and even then I'd work with this design for a long time before I gave it up voluntarily). You can take a more general principle from it as well: make it easy to swap components in and out of your application. The schema layer adds business logic to my data storage, but it's also valuable because it presents an interface between a database and my application. Anything polymorphic (or even allomorphic) to that interface should suffice to give me confidence between tests and deployed application.

(Of course the deployed database on disk has to match the schema in memory, but I did mention we have another mechanism to test that.)

Scraping the HTML output of a web application to see if your actions produced the right results is messy. It's also the most accurate way I know of to verify that your application behaves correctly from a user-initiated request to the server to a user-visible response.

I've used modules like Test::WWW::Mechanize and Test::WWW::Mechanize::Catalyst with a fair degree of satisfaction. I appreciate how they simplify the business of setting up a local server, making requests, filling out forms, and following links. I'm less satisfied with the methods content_contains() and ,content_like() for testing the presence substrings within the HTML output. When the tests pass, these methods work pretty well. When the tests fail, debugging is often tedious. I find myself writing code like:

sub test_index
{
    my $ua = get_ua();
    $ua->get( '/stocks' );
    exit diag $ua->content unless
    $ua->content_contains( 'Become a Great Investor',
        '/stocks should redirect to main page');
}

... and then removing those statements before I check in the passing code. That's tedious.

Besides improving the diagnostic messages, I'd like to check my substrings against only a subset of the produced HTML. There's no reason I need to worry about the navigation of the site (which is always the same and tested elsewhere) or the chrome of the particular page (also repeated).

I could cut off the UI layer and test that the values passed into the templates are appropriate, but that couples the tests to the templates and means I have to test the templates on their own anyhow. That's a mess.

I could instrument the application to render only a fragment of the whole template when given a special parameter, but that's extra code in the application I have to maintain and test.

What I'd rather do is give the test method some sort of selector (XPath, CSS) to grab a single HTML element out of the DOM and run the comparison against the contents of that element and its children.

You can accomplish this in multiple ways. I wanted to try out the use of this approach, so I hacked up a little test. This is not clean. You should probably not do this unless you want to maintain your own code. I might change this API. With that said, I like the results.

I have a small test library which handles the busy work of setting up a SQLite database in memory with DBICx::TestDatabase. It also loads Test::WWW::Mechanize::Catalyst and swaps its schema for the test schema. (I could do this from a separate initialization file, but I haven't done that yet.)

This test library now monkeypatches Test::WWW::Mechanize::Catalyst:

package Test::WWW::Mechanize::Catalyst;

sub dom_id_contains
{
    my ($self, $id, $string, $desc) = @_;

    my $dom  = Mojo::DOM->new( $self->content );
    my $text = $dom->at( $id )->content_xml;

    local $Test::Builder::Level = $Test::Builder::Level + 1;
    my $status = Test::WWW::Mechanize::contains_string( $text, $string, $desc );
    ::diag $text unless $status;
    return $status;
}

It's a little messy, but it works. (In particular, I dislike the ::diag() call, but it's fine for a proof of concept.)

This code takes the current context of the most recent request, creates a Mojo::DOM object, then uses the provided identifier as a CSS selector to find a node within the resulting DOM. It stringifies that node and its contents, then matches the provided substring against that stringified content.

The rest of the code makes this test method conform to the expected interface of other test methods.

Using this code has simplified both my testing and my debugging:


    $stock->update({ fcf_average => 0.20 });
    $ua->get( '/stocks/AA/view' );
    $ua->dom_id_contains( '#textual_pros', 'making money at a great rate',
        '... good results should be in the pros column' );

    $stock->update({ fcf_average => -0.02 });
    $ua->get( '/stocks/AA/view' );
    $ua->dom_id_contains( '#textual_cons', 'not making money',
        '... poor results should be in the cons column' );

The 15 minutes I spent coding this (I first tried XPath, but CSS selectors are so much nicer) were worth it to prove the merit of this idea. What's left is implementation.

This code would be even easier if the contains_string() method provided diagnostics, but I can understand why it doesn't.

Ideally this could be a role on Test::WWW::Mechanize::Catalyst which maintains its own Mojo::DOM DOM and clears that whenever a request occurs. It should also be more aware of Test::Builder to manage diagnostics in a cleaner way.

The biggest drawback of course is monkeypatching. The inheritance relationship which eventually lets Mech and Mech::Catalyst work is incompatible with things like Mech::PSGI. Monkeypatching one of these modules doesn't necessarily work for monkeypatching another.

This is a problem of object orientation and components that needs much more thought to solve well, but for now my tests are simpler and easier to maintain, and I'm comfortable enough with this little bit of mess until I find a better way to clean it up.

When you're collecting data from the wild and wooly Internet at large, you never know exactly what you're going to find.

This article is the third in a series which started with Annotating User Events for Cohort Analysis and continued in Detecting Bots and Spiders with Plack Middleware.

By the end of the previous article, I had a pretty good solution to the problem of detecting spiders and bots I'd seen before. I have confidence in the test suite I wrote to prove that my event logger ignores those user agents, so that any events they trigger in the web application do not get logged.

What I don't know is the identifying information of bots and spiders I haven't seen yet.

Any well behaved bot or spider should request robots.txt before beginning to crawl the site. Not all do. Worse, not all requests for that file belong to bots or spiders. I've looked at that file on various servers in a normal web browser.

I pondered the idea of logging requests for favicon.ico as evidence that a real user agent made a request, but that feels unreliable too.

If you saw either of my talks this summer (When Wrong is Better at YAPC::NA 2012 and How and When to Do It Wrong at Open Source Bridge 2012), you may have heard me say that I try to design my software so that it's idempotent. If you run it a second time on the same dataset, you should get the same results. There's no shame in running it a second time. Nothing goes wrong.

(I go further: it should be quick to run again, and if you've improved the data, you should get better results. This should be obvious, but too much software destroys the incoming data when the transform-filter-report approach is safer.)

Analysing the event log for patterns and reports is relatively fast. It's currently fast enough that re-running the past week or two of reports is fast enough. (It could be faster, but it's fast enough now, and that's fine for now.) I don't mind a few bots getting through and having their events logged because I can remove them and regenerate the reports.

I can identify them because I have a notes field in the event log database table:

CREATE TABLE cohort_log
(
    id         INTEGER      PRIMARY KEY AUTOINCREMENT,
    usertoken  VARCHAR(255) NOT NULL,
    day        INTEGER      NOT NULL,
    month      INTEGER      NOT NULL,
    year       INTEGER      NOT NULL,
    event      TEXT(25)     NOT NULL,
    notes      VARCHAR(255) DEFAULT ''
);

When the application processes a request and that request triggers the logging of an event, the log_cohort_event() method gets called. The previous article showed an earlier version of this code. The deployed code, as of this writing, is actually:

=head2 log_cohort_event

Logs a cohort event. At the end of the request, these get cleared.

=cut

sub log_cohort_event
{
    my ($self, %event)  = @_;
    my $env             = $self->request->env;
    return if $env->{'BotDetector.looks-like-bot'};

    unless ($self->sessionid)
    {
        $self->create_session_id;
        $event{notes} = $env->{HTTP_USER_AGENT};
    }

    $event{usertoken} ||= $self->sessionid;

    push @{ $self->cohort_events }, \%event;
}

You've seen half of this code before. If the custom Plack middleware has detected that the request came from a spider or bot, it's set the appropriate flag in the PSGI environment hash.

If the middleware hasn't set that flag, the request may have come from a human.

If the request has no active session—whether it's the first time a user has appeared on the site or whether the user refuses session tracking—the system needs a session id. This token is part of the cookie sent to users and the index into the server-side storage of session data. It's also the unique identifier for individuals within the event log. (I could track something like a user id, but users who don't have accounts on the system won't have those, and it's far too easy to correlate user ids with real people, and all of the owners of the company have agreed that we'll only ever do data mining on anonymized data. If it's too easy to correlate user actions with session ids—if someone ever finds their way into the server-side session storage in the small window when there's any identifying information in there—I'll hash the session ids when I use them as tokens. It hasn't been a problem yet, and I don't foresee it as a problem.)

Oh, there's one more reason why a request without a session ID might exist: it might come from a bot or spider. In my experience, most automated processes ignore session cookies.

In any case, at the point of identifying a request which looks like it may have come from a real user, the system adds the request's user agent to the event's notes field.

I haven't yet written the code to grab unique user agents out of this field every day, but that's trivial. (It's just one more report, and you can probably write the SQL for it in your head, much less the DBIC code.) If that report orders by the number of occurrences of that field, it's almost trivial to pick out more user agents that look like likely bots. Then we can do two things: filter out those requests from the event logs and re-run the reports and add those user agent strings to the Plack middleware that detects bots.

The process isn't completely automated, but it's automated enough that only a little bit of human interaction can polish the system such that it gets better every day. We can't prevent exceptional or undesired events from happening, but we can identify them, then remove them from the system.

In the absence of a perfect system with perfect knowledge, I'll take a robust system we know how to improve.

If you analyze the user requests of your web site, you'll have to deal with enormous numbers of bots and spiders and other automated requests for your resources which don't represent measurable users. As promised in Annotating User Events for Cohort Analysis, here's how I handle them.

I wrote a tiny piece of Plack middleware which I enabled in the .psgi file which bundles my application:

package MyApp::Plack::Middleware::BotDetector;
# ABSTRACT: Plack middleware to identify bots and spiders

use Modern::Perl;
use Plack::Request;
use Regexp::Assemble;

use parent 'Plack::Middleware';

my $bot_regex = make_bot_regex();

sub call
{
    my ($self, $env) = @_;
    my $req          = Plack::Request->new( $env );
    my $user_agent   = $req->user_agent;

    if ($user_agent)
    {
        $env->{'BotDetector.looks-like-bot'} = 1 if $user_agent =~ qr/$bot_regex/;
    }

    return $self->app->( $env );
}

sub make_bot_regex
{
    my $ra = Regexp::Assemble->new;
    while (<DATA>)
    {
        chomp;
        $ra->add( '\b' . quotemeta( $_ ) . '\b' );
    }

    return $ra->re;
}

1;
__DATA__
Baiduspider
Googlebot
YandexBot
AdsBot-Google
AdsBot-Google-Mobile
bingbot
facebookexternalhit
libwww-perl
aiHitBot
Baiduspider+
aiHitBot
aiHitBot-BP
NetcraftSurveyAgent
Google-Site-Verification
W3C_Validator
ia_archiver
Nessus
UnwindFetchor
Butterfly
Netcraft Web Server Survey
Twitterbot
PaperLiBot
Add Catalog
1PasswordThumbs
MJ12bot
SmartLinksAddon
YahooCacheSystem
TweetmemeBot
CJNetworkQuality
YandexImages
StatusNet
Untiny
Feedfetcher-Google
DCPbot
AppEngine-Google

Plack middleware wraps around the application to examine and possibly modify the incoming request, to call the application (or the next piece of middleware), and to examine and possibly modify the outgoing response. Plack conforms to the PSGI specification to make this possible.

Update: This middleware is now available as Plack::Middleware::BotDetector from the CPAN. Thanks to Big Blue Marble and Trendshare for sponsoring its development and release.

All of that means that any piece of middleware gets activated by something which calls its call() method, passing in the incoming request as the first parameter. This request is a hash with specified keys. The application, or at least the next piece of middleware to call, is available from object's accessor method app().

(I'm lazy. I use Plack::Request to turn $env into an object. This is not necessary.)

The rest of the code is really simple. I have a list of unique segments of the user agent strings I've seen in this application. I use Regexp::Assemble to turn these words into a single (efficient) regex. If the incoming request's user agent string matches anything in the regex, I add a new entry to the environment hash.

With that in place, any other piece of middleware executed after this point in the request—or the application itself—can examine the environment and choose different behavior based on the bot-looking-ness if any request. My cohort event logger method looks like:

=head2 log_cohort_event

Logs a cohort event. At the end of the request, these get cleared.

=cut

sub log_cohort_event
{
    my ($self, %event)  = @_;
    return if $self->request->env->{'BotDetector.looks-like-bot'};
    $event{usertoken} ||= $self->sessionid || 'unknownuser';

    push @{ $self->cohort_events }, \%event;
}

The embolded line is all it took in my application to stop logging cohort events for spiders. If and when I see a new spider in the logs, I can exclude it by adding a line to the middleware's DATA section and restarting the server.

(You might rather store this information in a database, but I'd rather build the regex once than loop through a database with a LIKE query. I haven't found an ideal alternate solution, which is why I haven't put this on the CPAN. Perhaps this is two modules, one for the middleware and one which exports a regex to identify spider user agents.)

There's one more trick to this cohort event logging: traceability. That's the topic for next time.

The heart of every successful agile or iterative process is thoughtful measurement and refinement. This requires measurement. In software development terms, you might ask "Can we deploy features more quickly?" or "Can we provide more accurate estimates?" or "Can we improve quality and reduce defects?"

In business terms—especially in startups and other small businesses searching for niches and customers and revenue—you might ask "How can we improve customer engagement?" and "How can we improve our rates of visitor to paying customer conversion?"

I've been experimenting with something called cohort analysis lately. The results are heartening. In short, you instrument your code to record notable user events. Then you analyze them.

I started by adding a single table to my database:

CREATE TABLE cohort_log
(
    id         INTEGER      PRIMARY KEY AUTOINCREMENT,
    usertoken  VARCHAR(255) NOT NULL,
    day        INTEGER      NOT NULL,
    month      INTEGER      NOT NULL,
    year       INTEGER      NOT NULL,
    event      TEXT(25)     NOT NULL,
    notes      VARCHAR(255) DEFAULT ''
);

A user may generate multiple events. Every event has a canonical name. I haven't made these into a formal enumeration yet, but that's on my list. Every event has a token I'll explain soon. Every event also has a notes field for additional information, such as the user-agent string for the "new visitor has appeared" event or the name of the offending template for the "wow, there's a bug in the template and the system had to bail out this request!" event.

(Separating the timestamp into discrete components is a deliberate denormalization I don't necessarily recommend for your uses. There's a reason for it, but I won't tell you which side of the argument I argued.)

I use DBIx::Class to help manage our data layer, so I have a CohortLog class. The resultset includes several methods to help generate reports, but it also has a special method to insert a new event into the table:

=head2 log_event

Given a hash reference containing key/value pairs of C<usertoken>, C<event>,
and optionally C<notes>, logs a new cohort event. Throws an exception without
both required keys.

=cut

sub log_event
{
    my ($self, $args) = @_;

    do { die "Missing cohort event parameter '$_'\n" unless $args->{$_} }
        for qw( usertoken event );

    my $dt    = DateTime->now;
    $args->{$_} = $dt->$_ for qw( year month day );

    $self->create( $args );
}

This automatically inserts the current (timezone-adjusted) time values into the appropriate columns. (Again, a good default value in the database would make this work correctly, but we're sticking with this tradeoff for now.)

I added a couple of methods to the Catalyst context object so as to log these events:

=head2 log_cohort_event

Logs a cohort event. At the end of the request, these get cleared.

=cut

sub log_cohort_event
{
    my ($self, %event)  = @_;
    $event{usertoken} ||= $self->sessionid || 'unknownuser';

    push @{ $self->cohort_events }, \%event;
}

=head2 log_cohort_template_error

Turns the previous cohort event into a template error.

=cut

sub log_cohort_template_error
{
    my $self     = shift;
    my $template = $self->stash->{template};
    my $page     = $self->stash->{page} || '';
    my $event    = @{ $self->cohort_events }[-1];

    $event->{event}  = 'TEMPLATEERROR';
    $event->{notes} .= $template . ' ' . $page;
}

=head2 record_cohort_events

=cut

sub record_cohort_events
{
    my $self          = shift;
    my $events        = $self->cohort_events;
    my $cohort_log_rs = $self->model( 'DB::CohortLog' );

    for my $event (@$events)
    {
        $cohort_log_rs->log_event( $event );
    }

    @$events = ();
}

The most important method is log_cohort_event(), which takes named parameters corresponding to the cohort's data. The token associated with each event comes from the user's session id. (You can see a couple of flaws to work around, namely that some requests have no session information, such as those from bots and spiders, and that session ids may change over time. There are ways to work around these.)

The log_cohort_template_error() method is more diagnostic in nature. It modifies the previous event to record an error in the template, as there's no sense in recording that a user performed an event when that event never occurred successfully. (Another part of the system detects these catastrophic events and calls this method. Hopefully it never gets called.)

Finally, record_cohort_events() inserts these events into the database. This method gets called at the end of the request, after everything has rendered properly and has been sent to the user. This prevents any error in the event system from causing the request to fail and it reduces the apparent user latency.

How does it look to use this logging? It's almost trivial:

=head2 index

The root page (/)

=cut

sub index :Path :Args(0)
{
    my ( $self, $c ) = @_;

    $c->log_cohort_event( event => 'VIEWEDHOMEPAGE' );
    $c->stash( template => 'index.tt' );
}

=head2 send_feedback

Allows the user to send feedback about what just happened.

=cut

sub send_feedback :Path('/send_feedback') :Args(0)
{
    my ($self, $c) = @_;
    my $method     = lc $c->req->method;

    return $c->res->redirect( '/users' ) unless $method eq 'post';

    my $params     = $self->get_params_for( $c, 'feedback' );
    $c->model( 'UserMail' )->send_feedback( $c, $params );

    $c->add_message( 'Feedback received! '.
                     'Thanks for helping us make things better!' );

    $c->log_cohort_event( event => 'SENTFEEDBACK' );
    return $c->res->redirect( $params->{path} || '/users' );
}

These two controller actions each call $c->log_cohort_event with a specific event string. (Again, these could easily be constants generated from an enumeration in the database, but we haven't needed to formalize them yet.) While I considered making a Catalyst method attribute (like :Local or :Args to enforce this logging with an annotation, we decided that the flexibility of logging an event selectively outweighed the syntactic concerns of adding a line of code. Only after a user has actually sent feedback, for example, does the SENTFEEDBACK event get logged.

Testing for this logging is almost trivial.

Reporting is slightly more interesting, but how you do that depends on how you divide your userset into distinct cohorts.

The last exciting problem is how to detect spiders, bots, and other non-human user agents to exclude them from this analysis. Optimizing the sales and conversion and retention and engagement funnels for automated processes makes little sense. I have some ideas—some of them amazing failures—but that's a story for another time.

One of my web projects has several directories which contain, essentially, static files with a little bit of dynamic content in the wrapper. This dynamic content is specific to each user—a change in navigation, some customization.

The prototype of this project split the responsibility for serving the site between the frontend web server and the backend Plack server. As you can imagine, it worked well enough for a prototype, but it clearly needed replacement once the project became serious.

I didn't want to invest a lot of effort into writing a bunch of unique controllers for each directory with templated files (/about, /contact, /resources, et cetera), but I want to share behavior between all of these routes.

Essentially I have several subdirectories in my templates/ directory which correspond to these routes. Each subdirectory contains one or more templates for each resource within that route, such as /about/jobs and /contact/press. Something in my Catalyst application needs to be able to resolve those requests to those templates.

That's easy enough.

I ended up factoring this code into a parametric role to apply to controllers because I want to reuse this code elsewhere, and because it needs a bit of customization. Here's all it is:

package Stockalyzer::Controller::Role::HasStaticPath;
# ABSTRACT: maps controller methods to static routes

use Modern::Perl;
use MooseX::Role::Parameterized;

parameter 'routes', required => 1;
parameter 'for',    required => 1;

role
{
    my $p     = shift;
    my $class = $p->for;

    for my $route (@{ $p->routes })
    {
        my $template = $route . '.tt';

        method $route => sub
        {
            my ($self, $c) = @_;
            my $args       = $c->req->args;
            my $page       = $args->[0] || 'index';

            return $c->res->redirect( "/$route/$page" ) if $page =~ s/\..+$//;

            $c->stash(
                template => $template,
                page     => $page,
            );
        };

        $class->config( action => { $route => { Local => '' } } );
    }
};

1;

A few things stand out. This role has two required parameters, an array reference of routes to add to the composing controller and the name of the controller. The latter will be obvious shortly.

Each route corresponds to a single template file which itself has a page attribute. I do this so that each subdirectory can have its own layout. (You can accomplish this in many ways with Template Toolkit, but this approach works best for me right now.)

This action doesn't use Catalyst's own path handling to handle any arguments, so it effectively takes as many path components as you can provide. If there's no path component beyond the name of the route, this request defaults to the index page. (The redirection is a temporary measure because the system has a few links to static pages such as /about/jobs.html we're changing to /about/jobs.)

The only remaining interesting part of the code is the call to the controller's config() method. This is the reason for the for parameter to this role. Because these controller methods come from the role, and because the actual body of the method is a closure, Catalyst can't easily process the normal function attributes you normally use to select special behaviors. I want to define these methods like:

sub about   :Local { ... }
sub contact :Local { ... }

... and so the alternate approach is to set configuration parameters for each method. That's all the final line of code in the role application block does.

Using this code is easy. My root controller contains a single line:

with 'Stockalyzer::Controller::Role::HasStaticPath'
    => { routes => [qw( about strategy )], for => __PACKAGE__ };

The best part is that if I want to add a new subdirectory, mapping it in Catalyst means adding a single entry to this list. Better yet, if I want to add features to these subdirectories (and I do, per How Would You Track User Behavior with Plack and Catalyst?, for the purpose of cohort logging), I can add it in one place.

All of this demonstrates one of my favorite features of Modern Perl: well-designed abstractions are available when they're necessary. (One of my other favorite features of Modern Perl is that someone's probably already done this, and I just have to find the right plugin on the CPAN to make it happen.)

Why I Use Perl: Testing

| 8 Comments

Of all the languages I've used for pay and for hobby, none compare to Perl in terms of testing culture and ecosystem.

Sure, with a few seconds and your favorite search engine anyone can find countless examples of awful code written by people who had no concern for writing good code. (That's not a language problem.) Sure, you can find countless examples of Perl code written to the standards of 1992 with little regard for documentation or formatting or robustness or even the minimum effort at basic procedural programming. (That's not a language problem.)

Shameless plug: I wrote a book called Modern Perl. You can buy Modern Perl: the book in paperback from Amazon (and other booksellers) or buy Modern Perl: the book in Kindle format—or read it online or download it as PDF or ePub for free. Furthermore, if you'd like to talk about how to improve the testing of your product or project, I am available for consulting.

One of the reasons we can talk about such things as Modern Perl is due to the quality of Perl's testing culture. I had a seat in the front row for almost all of the Perl testing and quality revolution, starting in 2000.

Actually it starts in 1987. If you find and download a Perl 1.0.x tarball, you'll see that it includes a tiny test harness and a series of language tests. This predates the notion of Test-Driven Development. It predates even Test-First development. (As far as I can tell, it even predates the invention of SUnit, the Smalltalk test framework that inspired xUnit, arguably the most popular style of testing in most languages.)

Update: As Larry himself said in a 1997 Larry Wall interview with Linux Journal:

You can restructure all your code into modules and unit test it in a jiffy because the Perl interpreter is so handy to invoke.

In 2000 and 2001, Perl 5 started taking testing more seriously. Even though Perl 5 has no formal specification outside of "Whatever the implementation does, as long as the documentation agrees," a group calling itself Perl QA took up the banner of developing tests, a testing system, and a testing culture to help Perl grow and evolve safely and on purpose through the next phase of its life.

As part of that process, Michael Schwern and I developed a library called Test::Builder to unify the internals of multiple test libraries and to allow future test libraries to share the same backend.

It's been wildly successful.

It's been so successful that you can download from the CPAN today hundreds of testing libraries which are all composable and play together nicely in the same process in the same file. They all work together with the standardized test harness and reporting tools because the Perl world does agree on formal standards. See TAP.

(You don't even have to use Perl to take advantage of TAP. I've written TAP emitters and Test::Builder libraries in multiple languages.)

That's one area of success. Another area of success is the adoption of testing and testing tools by people who don't write testing tools. (Of course people in Perl QA would use these tools, but if they never reach anyone else, what's the point?)

After Schwern and I made Test::Builder, I started to work on the test coverage for Perl 5.8 and its core library. So did other people. The number of tests of the core language and its libraries quadrupled. So did the quality of those tests, as the adoption of newer, better test libraries improved. So did the quality of those tests and those libraries as we gained experience writing good tests and understanding how to write better tests.

The quality and test coverage of CPAN and deployed Perl applications improved, too.

As I wrote in Why I Use Perl: Reliability, it's reasonable to expect that you can install a new version of Perl 5, run all of the tests for CPAN dependencies, run all of the tests for your application, and everything will just work. This isn't magic. It's science.

As part of the process of developing Perl 5, a few people run automated test runs of the CPAN against commits to the new version of Perl 5 in progress. Read that sentence again carefully. Automated processes can tell you when a commit to Perl 5 in progress causes a test to fail on the CPAN—not merely one of Perl 5's core language tests or a test in the Perl 5 core library, but a test in a CPAN distribution. An automated process will notify the maintainer of that CPAN distribution as well as the developers of Perl 5 with a link to the offending commit.

The collective test suite of the CPAN (as of this writing, 108889 modules in 25473 distributions, for a collection of millions of tests) is the continuous integration test suite of the Perl 5 language itself.

Similarly, a larger army of automated test runs reports the test results of new CPAN uploads against a huge array of platforms and Perl 5 versions. This is CPAN Testers. Within a few minutes of uploading a new distribution to the CPAN, you may get back a test result. Within a couple of days, you will have plenty of test results.

The infrastructure is there. The will to quality is there. The history of encouragement with code and documentation and tools and community expectation is there. The onus is on people writing Perl to take advantage of the testing ecosystem to write the right code.

(Did I mention that other great tools exist to test your code coverage, to test the coverage of your documentation, and even to add tests for coding standard and awkward semantic violations?)

I've written code in a lot of languages. I've debugged code in all of those languages. I've used TDD in most of those languages (excluding Postscript and minicomputer BASIC). While I've seen a focus on good testing in many of those language communities, Perl's the only language community that I see that takes testing really seriously.

(Addendum: the existing Perl Testing book is still decent, but I have this recurring notion of writing a new one. If I started a Kickstarter project to gauge interest, would you pay for early access to electronic versions of the book in progress?)

Most of the interesting problems of programming are a combination of applied theory and the messy details of your specific problem.

In From ODT to POD with Perl I demonstrated the use of open classes to solve a transliteration problem of walking an existing tree of existing objects. That's the applied theory. When you already have parent and child relationships expressed in an existing object model, representation and traversal are simple. Similarly, when you know specific details of what you want to emit when you visit each object in such a tree, emitting the right things is simple.

The messy details in this problem come from figuring out what to emit. The code I posted has something called "style mappings" which translate the names of styles defined in the header of the ODT to POD styles.

(Remember that this code is a one-off conversion program, so it can be a little bit messy and it can live in a single file. It also doesn't have to be 100% perfect. We can edit out a few nits if it gets a couple of small things wrong, because the goal of the program is to save us many hours of work.)

Remember that the styles in the ODT file look like:

<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
    <style:paragraph-properties fo:background-color="#666699">
        <style:background-image />
    </style:paragraph-properties>
    <style:text-properties fo:color="#ffffff" style:font-name="Calibri" fo:font-size="14pt" fo:font-weight="bold" style:font-size-asian="14pt" style:font-weight-asian="bold" style:font-size-complex="14pt" style:font-weight-complex="bold" />
</style:style>

This XML is reasonably sane; it's reasonably regular. I went through the XML for a file formatted with our document template and made a list of the formatting characteristics of every unique markup element we want to express in our PseudoPod output. For example, every element we'll express as a header has a specific background color. Every element we'll express with the code tag (C<>) has a font name of "Courier New". Of two tags (style:paragraph-properties and style:text-properties, siblings under a style:style tag), I care about six attributes:

  • fo:background-color, the background color of the paragraph
  • fo:color, the color of the font itself
  • fo:font-weight, any boldness of the font
  • style:font-name, the name of the font
  • fo:font-style, any italicizing of the font
  • fo:font-size, any change in the font's size

The various allowed combinations of those six attributes for any of the defined styles produce the names of the methods to call to emit the appropriate POD for text nodes which have those styles applied.

This is easier to see with code:

sub get_xml_style_methods
{
    my $xpath   = shift;
    my $nodeset = $xpath->find( '//style:style' );
    my %styles  =
    (
        Standard => 'toPodForPlain',
        Empty    => 'toPodForEmpty'
    );

    for my $node ($nodeset->get_nodelist)
    {
        my $style_name = $node->getAttribute( 'style:name' );
        my $paraprops  = $node->find( './style:paragraph-properties' );
        my $bgcolor    = $paraprops
                       ? $paraprops->shift
                                   ->getAttribute('fo:background-color')
                       : '';
        my $textprops  = $node->find( './style:text-properties' );

        my ($color, $weight, $name, $style, $size) = ('') x 5;

        if ($textprops)
        {
            my $text_node = $textprops->shift;
            my $maybe     = sub
            {
                $text_node->getAttribute( shift ) || ''
            };

            $color        = $maybe->( 'fo:color'       );
            $weight       = $maybe->( 'fo:font-weight' );
            $name         = $maybe->( 'style:font-name');
            $style        = $maybe->( 'fo:font-style'  );
            $size         = $maybe->( 'fo:font-size'   );
        }

        my @properties;

        if ($bgcolor)
        {
            @properties     = 'Head0'  if $bgcolor eq '#666699';
            @properties     = 'Head2'  if $bgcolor eq '#9999cc';
        }
        else
        {
            push @properties, 'Code'   if $name    eq 'Courier New';
            push @properties, 'Bold'   if $weight  eq 'bold';
            push @properties, 'Italic' if $style   eq 'italic';
            @properties     = 'Head1'  if $size    eq '14pt';
        }

        @properties          = 'Plain' unless @properties;
        my $type             =  $style_name    =~ /^P/
                             && $properties[0] !~ /^Head/
                             ? 'Para'
                             : '';
        $styles{$style_name} = 'toPodFor'
                             . join( '', sort @properties ) . $type;
    }

    return \%styles;
}

Even though in theory any style could have any combination of those six attributes, in practice our documents are very similar and consistent. Any background color immediately makes a style only and ever a toPodForHead\d style.

(I like the use of the closure $maybe to abstract away the || '' repetition. Learn Scheme!)

I didn't mention that the names of styles have the form P\d+ or T\d+ in the ODT file. These names seem to signify whether the style applies to a block-level element (a paragraph as a whole) or a snippet of inline text (a word in monospace or italicized text, for example.) Rather than dealing with those individually, I glom them all together and append the word Para to those styles which apply to paragraphs as a whole. It's a shortcut. It's a little messy.

I did offer a small apologia for this code not being fully abstracted and factored as far as you might like, but note that this code is fully encapsulated into its own function. It takes an XML::XPath object and returns a hash of style names to transliteration method names. It doesn't mess with global state.

That's important because:

#!/usr/bin/env perl

use Modern::Perl;
use autodie;

use Archive::Zip;
use HTML::Entities;
use Regexp::Assemble;

use XML::XPath;
use XML::XPath::XMLParser;

exit main( @ARGV );

sub main
{
    my @files = @_;
    @files    = get_all_files() unless @files;

    for my $file (@files)
    {
        my $xml = get_xml_contents( $file );
        my $pod = rewrite_xml( $xml );
        write_as_pod( $file, $pod );
    }

    return 0;
}

sub get_all_files { <*.odt> }

sub get_xml_contents
{
    my $file    = shift;
    my $zip     = Archive::Zip->new( $file );
    my $content = $zip->memberNamed( 'content.xml' );
    return $content->contents;
}

...

I could let this program work on one file at a time, but I'd rather loop within main() over any and all files given as arguments then write a loop in shell. I do want to make that hash of style names to transliteration methods available somehow globally, but only while processing each file individually, so closures again come to the rescue:

{
    my $style_methods;

    sub set_methods_for_styles   { $style_methods = shift    }
    sub clear_methods_for_styles { undef $style_methods      }
    sub get_method_for_style     { $style_methods->{ $_[0] } }
}

sub rewrite_xml
{
    my $contents = shift;
    my $xpath    = XML::XPath->new( xml => $contents );

    set_methods_for_styles( get_xml_style_methods( $xpath ) );
    my $pod           = xml_to_pod( $xpath );
    clear_methods_for_styles();

    return $pod;
}

Hiding that global state behind an accessor with a writer and clearer means that only one place has to reset that global state between working on files. If I were to write this program with more robustness, I'd add an exception handler around the call to xml_to_pod() so that the clearer always gets called (unless something catastrophic happens, like a segfault or my server falling into a micro black hole), but the two types of problems I've had in this code are compilation-killing typos and missing transliteration methods. Both kill the program effectively enough that they need immediate fixing.

All in all, this program was a lot easier to write than I thought it would be. I'd written a regex-based converter from Google Docs HTML output to PseudoPod before and it was awful. This approach was a lot easier. Much credit goes to ODT for being relatively sane in its XML, but much credit goes to the ClubCompy documentation for having a regular (if informal) template. Perl deserves a lot of credit for great XML processing modules, open classes, dynamic dispatch, and other higher level programming techniques that let clever people make little messes to make difficult problems tractable.

If you want to solve this problem with less monkeypatching, Robin Smidsrød's XML::Rabbit looks very effective. The essential messiness of the problem doesn't go away, but creating an effective object model from an XML document with XPath will get you halfway to the right solution.

From ODT to POD with Perl

| 2 Comments

When I have to convert data between formats, I reach for Perl. While many people think Perl's built in regular expressions make data munging easy, my experience is that Perl's multi-paradigm nature and dynamic programming flexibility are more important.

The Problem

I help run ClubCompy, a retro-inspired, zero-installation, browser-based programming environment designed to help children learn about computing. One of the reasons they recruited me is to design the educational components, including documentation. (I also know a few things about compilers and business.)

While ClubCompy has a surprising amount of power in its underlying virtual machine, that power is currently exposed in a programming language called Tasty—a mixture of 8-bit BASIC and Logo.

As with most systems which evolve from a simple idea into something else, following the law of opportunism, the project's structure and organization and tooling has accreted organically instead of following a rigid design. (Startup hackers: your job is to prune things when necessary until you discover the core of your business.) In particular, the documentation for the Tasty language exists in a series of OpenOffice files, one per language keyword.

The good news is that documentation exists. It's mostly complete, too: every keyword has documentation, and most of it is comprehensive. (Maybe 15 or 20% needs expansion, but we'll get there.)

The bad news is that the documentation exists in .odt files. They're not binary blobs, but they don't fit with our publishing system: they're too difficult to convert to clean PDF or very clean HTML for use throughout the system. They're also a mess when checked into source control.

Monday I decided to convert them to POD. (ClubCompy uses the Onyx Neon publishing toolchain designed for things like Modern Perl: the book. Everything not yet available on the CPAN is available from my Github account.)

Inside ODT Files

An OpenOffice .odt file is a zipped archive of several other files. Fortunately, there's only one file I care about and very fortunately, it's a reasonably self-contained XML file. Getting the contents of content.xml is easy with a little bit of Archive::Zip code:

use Archive::Zip;

sub get_xml_contents
{
    my $file    = shift;
    my $zip     = Archive::Zip->new( $file );
    my $content = $zip->memberNamed( 'content.xml' );
    return $content->contents;
}

All of the Tasty keywords follow a standard template for documentation. This is both good and bad. It's good that discovering out how OpenOffice represents each unique element in XML is relatively easy: figure it out once and that representation should apply to all files. It's bad that the documentation template didn't use custom semantic styles, like "Top-level Header" and "Program Code".

That means all of the styles are ad hoc:

<office:automatic-styles>
        <style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
            <style:paragraph-properties fo:background-color="#666699">
                <style:background-image />
            </style:paragraph-properties>
            <style:text-properties fo:color="#ffffff" style:font-name="Calibri" fo:font-size="14pt" fo:font-weight="bold" style:font-size-asian="14pt" style:font-weight-asian="bold" style:font-size-complex="14pt" style:font-weight-complex="bold" />
        </style:style>
        <style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard" style:master-page-name="">
            <style:paragraph-properties fo:margin-left="0.2602in" fo:margin-right="0in" fo:text-indent="0in" style:auto-text-indent="false" style:page-number="auto" fo:background-color="#9999cc">
                <style:background-image />
            </style:paragraph-properties>
            <style:text-properties fo:color="#ffffff" style:font-name="Calibri" fo:font-size="12pt" fo:font-weight="bold" style:font-size-asian="12pt" style:font-weight-asian="bold" style:font-size-complex="12pt" style:font-weight-complex="bold" />
        </style:style>
        ...
</office:automatic-styles>

I'll explain that more later.

The actual text of each file resembles:

<office:body>
        <office:text>
            <text:sequence-decls>
                <text:sequence-decl text:display-outline-level="0" text:name="Illustration" />
                <text:sequence-decl text:display-outline-level="0" text:name="Table" />
                <text:sequence-decl text:display-outline-level="0" text:name="Text" />
                <text:sequence-decl text:display-outline-level="0" text:name="Drawing" />
            </text:sequence-decls>
            <text:p text:style-name="P1">Keyword</text:p>
            <text:p text:style-name="P9">WHILE<text:span text:style-name="T1">-DO</text:span>
            </text:p>
            <text:p text:style-name="P8">
                <text:span text:style-name="T2">END</text:span>
            </text:p>
            ...
        </office:text>
   </office:body>

All of the text of the documentation is available under <text:p> tags.

Extracting Text

Extracting this text is a job for XPath. While I could get more specific with the XPath expression (find all direct children of <office:text>), I went for the simple solution at first:

use XML::XPath;
use XML::XPath::XMLParser;

sub rewrite_xml
{
    my $contents = shift;
    my $xpath    = XML::XPath->new( xml => $contents );

    set_methods_for_styles( get_xml_style_methods( $xpath ) );
    my $pod           = xml_to_pod( $xpath );
    clear_methods_for_styles();

    return $pod;
}

sub xml_to_pod
{
    my $xpath   = shift;
    my $nodeset = $xpath->find( '//text:p' );

    my $pod;

    for my $node ($nodeset->get_nodelist)
    {
        my $style  = $node->getAttribute( 'text:style-name' );
        $style     = 'Empty' if @{ $node->getChildNodes } == 0;
        my $method = get_method_for_style( $style );
        $pod      .= $node->$method;
    }

    return $pod;
}

Ignore the get_method_for_style() calls for now. The important part of xml_to_pod is that it finds these tags in the XML and performs an action on each of them.

What's that action? Transforming it to POD, of course.

Look in the sample XML again. Each of the paragraphs has an associated style tag. That style refers to one of the styles declared earlier in that file. Given the name of a style, the body of the loop finds the name of a method and calls that method to transliterate the contents of that tag to POD.

Transliterating to POD

Here's where the power of Perl really shines. Every node in that nodeset is an instance of XML::XPath::Node::Element. That class knows nothing about POD. At least, it knows nothing about POD until I declared some methods in it:

package XML::XPath::Node::Element;

sub kidsToPod { join '', map { $_->toPod } shift->getChildNodes }

sub toPod
{
    my $self   = shift;
    my ($name) = $self->getName =~ /text:(\w+)/;
    my $method = 'toPodFor' . ucfirst $name;
    return $self->$method;
}

sub toPodForEmpty { ''     }
sub toPodForS     { ' '    }
sub toPodForTab   { '    ' }

sub toPodForSpan
{
    my $self   = shift;
    my $style  = $self->getAttribute( 'text:style-name' ) // '';
    $style     = 'Empty' if @{ $self->getChildNodes } == 0;
    my $method = main::get_method_for_style( $style );

    return $self->$method;
}

sub toPodForBold         { 'B<' . shift->kidsToPod . '>' }
sub toPodForCode         { 'C<' . shift->kidsToPod . '>' }
sub toPodForCodePara     { '    ' . shift->kidsToPod . "\n" }
sub toPodForItalic       { 'I<' . shift->kidsToPod . '>' }
sub toPodForPlain        { shift->wrapKids( '', '' ) }
sub toPodForPlainPara    { shift->wrapKids( '', '' ) . "\n\n" }
sub toPodForBoldCode     { 'C<B<' . shift->kidsToPod . '>>' }
sub toPodForBoldCodePara { 'C<B<' . shift->kidsToPod . ">>\n" }

sub toPodForHead0 { shift->wrapKids( '=head0 ', "\n\n" ) }
sub toPodForHead1 { shift->wrapKids( '=head1 ', "\n\n" ) }
sub toPodForHead2 { shift->wrapKids( '=head2 ', "\n\n" ) }

sub wrapKids
{
    my ($self, $pre, $post) = @_;
    my $kid_text            = $self->kidsToPod;
    return '' unless $kid_text;

    return $pre . $kid_text . $post;
}

Because Perl has open classes, you can add methods to classes (or redefine methods) any time you want. Because Perl has dynamic method dispatch, you can use a string as the name of a method to call.

You can see that this code gets a little bit messy here. That's part and parcel of the tree transformation technique central to compilers; the real world is messy, and that mess has to go somewhere.

The wrapKids() method handles the case where one of these nodes has no textual content but does have a specific style. Given a snippet of documentation like:

Example 1:
    10 x = 0
    20 WHILE x LT 26 DO
    30  PRINT TOCHAR x + 65
    40  x = x + 1
    50 END
    RUN

    (prints ABCDEFGHIJKLMNOPQRSTUVWXYZ)

... the blank line between RUN and the output is a unique paragraph with the monospace font applied. A naïve output from one of these methods might produce the POD C<> for that line. wrapKids() prevents that.

This open class approach works very well. It scales well too in terms of complexity. Even if this code eventually migrates to build a POD document model (see Pod::PseudoPod::DOM), giving individual nodes the responsibility of emitting a tree or text moves the custom behavior to where it most belongs.

(The benefit of a DOM is that basic tree transformation rules can take care of pruning out unnecessary elements, such as the blank code line.)

The Little Details

The XML::XPath::Node::Elements may nest, but you can see how that nesting works just fine through the toPod() method. Those ::Element classes may themselves also contain XML::XPath::Node::Text instances as children. These objects represent plain text.

So far, I've only found one situation where this plain text needs any manipulation. Adding one method fixes this:

package XML::XPath::Node::Text;

sub toPod
{
    my $raw_text = HTML::Entities::decode_entities( shift->toString );
    return main::encode_pod( $raw_text );
}

The encode_pod() function (it's in main so as not to make it available as a method inadvertently) is:

use Regexp::Assemble;

my %escapes =
(
    '<' => 'E<lt>',
    '>' => 'E<gt>',
);

sub encode_pod
{
    state $replace = make_regexp( \%escapes );

    my $text = shift;
    $text =~ s/($replace)/$escapes{$1}/g;

    return $text;
}

sub make_regexp
{
    my $escapes = shift;
    my $ra      = Regexp::Assemble->new;
    $ra->add( $_ ) for keys %$escapes;
    return $ra->re;
}

More robust solutions exist, but so far this is all I've needed.

I do admit that the implementation is a little messy in places. That's one of the problems with this compiler technique: sometimes you have data that needs to be available everywhere but you don't want to pass it as arguments everywhere and you don't want to wrap up everything in intermediary objects because you're already using perfectly good objects from elsewhere.

I haven't shown the code which identifies styles and makes the hash of style name to output method yet; that's for the next post. I'm sure you can start to figure out how it works already.

Inertia + Velocity = Momentum

During the discussions which led to Parrot and Rakudo having monthly releases (and which helped lead Perl 5 to monthly bleadperl releases and yearly major releases), I focused on predictability. This concern has two parts. Users can plan their upgrades around the release schedule and can see how the project makes and meets public commitments. Developers can find a rhythm in the schedule to help them plan testing and merging and scheduling of changes.

Another overlooked aspect of a reliable (and relatively short) schedule periodicity comes from physics. A project at rest tends to stay at rest. A project in motion tends to stay in motion. (Dividing the viability and utility of a project between its users and its developers seems useful. Think of "motion" as both "used productively by users" and "under active development".)

That's inertia. What's more important than inertia? Momentum.

A project at rest will take effort to put in motion again. When it begins to move, its momentum is a product of its inertia (how much it can stay in motion) and its velocity (the progress it makes). Similarly a project in motion has a momentum of its inertia (how likely it is to stay in motion) and its velocity.

We want to increase momentum for our projects in progress.

Obviously the direction of the motion is important. A project picking up momentum going the wrong way is in trouble just as a project losing momentum even if it's going the right way.

Momentum as a metaphor has another good quality going for it. Because it's a combination of velocity and inertia, you can have a project with good momentum with a little bit of velocity and a lot of inertia just as much as a project with lots of velocity and a little inertia. A small library everyone uses doesn't need a lot of velocity to remain viable. A new project used nowhere needs a lot of velocity to make it to the point where it is viable.

This is why it's difficult to watch a project slowly lose developers. If a project is mature—if it's reaching a point where maintenance is more important than new development—then inertia can take over from velocity. If a project hasn't reached that point, losing velocity almost certainly means momentum will slow or even stop.

One of the worst things you can do in your project is to impede momentum. You've probably heard of Joel Spolsky's excoriation of the Big Rewrite, but you may not have read Adam Turoff's Long Now take on rewriting software. Spolsky's point is about the momentum of the short term. If you reduce your project's inertia to zero, you'll face a superhuman effort to get it back. Turoff's point is about the long term. If you can survive the short term, you may be better off.

(Mozilla is the tofu of project management. It takes on the flavor of any argument you pair with it.)

I wish I could give more practical advice than throwing out a metaphor, but as I look over the history of Parrot and P6, this metaphor fits better than most. Parrot has, over its history, had a lot of velocity but it never had much momentum. One of its biggest failures is not running a working P6 implementation from the start. (Sure, it had some attempts in that direction, but Pugs was the first attempt at anything actually serious, and that wasn't a Parrot project at all.) One of Parrot's worst failures is that we tried to give it inertia beyond P6 and failed miserably.

One of P6's failures is that it's never achieved much in the way of moving inertia. In the past couple of years Rakudo has improved its velocity greatly, but that's been at the expense (at least in opportunity costs) of Parrot, and thus it's undercut its inertia. Years of work have gone into getting Rakudo to more or less the same partially-implemented state on multiple backends, and that state is still not ready for general use.

Certainly the cost of development is important, but so is the likelihood of success. Can you analyze the state of a project in terms of its inertia (for its developers and its users) as well as its velocity? I believe a project's viability hinges on its momentum. You can draw the four quadrants and plot your project to see the risks involved.

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

sponsored by the How to Make a Smoothie guide

Categories

Pages

About this Archive

This page is an archive of entries from August 2012 listed from newest to oldest.

July 2012 is the previous archive.

September 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.


Powered by the Perl programming language

what is programming?