API as Documentation

In all of the silly kerfuffle about how awesome rock star pirate ninjas all write their own domain-specific languages, people sometimes can't see past all of the bravada falsa for a serious point. (That's excusable: when your reason for doing something is to show off how awesome you want people to think you are, you don't always recognize the useful things that you incidentally happen to create.)

When "writing a DSL" becomes less about "Look how awesome I am too" and more about "I want to simplify this code" or "I can make this easier to use" or "There's an abstraction here that removes a lot of boilerplate" or "It's safer to write code this way", you can sort of edge sideways into realizing that the right API can describe your problem and the solution in a way that the wrong or at least the naïve API cannot.

I wrote some code to migrate data from a SQLite database to a PostgreSQL database. (DBIx::Class::Migration helped, but DBIx::Class::Fixtures turned out not to work for uninteresting technical reasons. (I've filed a couple of bugs on this before and the maintainers have fixed them, but the constraints of this particular project are way outside what that module can reasonably handle.)

The easiest solution that would work in the allotted time and space was to write my own importer from CSV files dumped out of SQLite into PostgreSQL. The only problem was matching foreign keys. (Yes, I know about deferred constraints and bulk loading. Unfortunately, SQLite's laxity made the dataset a little less robust than I wanted, hence the move to PostgreSQL.)

The CSV files contain a primary key for most tables: an articles table might have an article_id column, where a references table might refer to an article by its article_id. By inserting tables in dependency order, foreign key resolution is much easier... unless you let the database remap primary keys.

I used Text::CSV_XS to fetch each row from the CSV files. That gives a $row which contains an anonymous array of values for the specific row in the table. Some of those values contain foreign keys which the code must map to the new ids.

You probably already know what's coming:

while (my $row = $csv->getline( $fh ))
        {
            # remove the id
            my $prev_id = shift @$row;
            $sth->execute( @$row )
                or die( $sth->errstr . "(@$row)\n" );

            $ids{$table}{$prev_id} =
                $dbh->last_insert_id( undef, undef, $table, 'id' );
        }
    }

As you probably guessed, a hash maps existing IDs from rows to their new IDs as inserted into the database. To make this work, the code has to perform a fixup (see, Everything is a Compiler

!) which fixes the foreign keys for every row in every table. For example:

sub fix_entry_images
{
    my ($ids, $row) = @_;
    swap( $row, $ids, images  => 0 );
    swap( $row, $ids, entries => 1 );
}

This is all really boring code, except for swap(), which is exceedingly boring code:

sub swap
{
    my ($row, $ids, $name, $pos) = @_;
    return unless $row->[$pos];
    $row->[$pos] = $ids->{$name}{ $row->[$pos] };
}

That would be easy to write in line in each of these functions, but look again at its use:

    swap( $row, $ids, images  => 0 );
    swap( $row, $ids, entries => 1 );

Yes, that's shorter than writing it inline, but it's also a lot clearer. I had to debug a couple of bugs (I wrote at least two bugs in this when I first wrote this code) and it was immediately obvious what they were when I saw what I'd written wrong. (I had the pluralization of a table wrong, because I had repeated table names in multiple places.)

I'm not silly enough to claim that a single function definition makes a DSL or pidgin or embedded language. Not at all! But writing a function here and making it at least somewhat obvious what's going on and why means that seeing the bug and fixing it everywhere it's present is very possible.

I've seen a lot of novice code that packs functions full of as much code as the coder can keep in his or her mind at a time. My functions and methods are as small as possible—sometimes as small as I can imagine them to be while still giving them names. I've learned this the hard way: even if I don't intend to reuse a function from multiple places, the discipline of giving it and its arguments distinctive and sensible names forces me to understand what's really going on.

As in this case, if I'm careful about all of this information, it also can help make what's happening—and why—clear.

2 Comments

Ron Savage | November 12, 2012 1:05 PM

Would DBIx::Admin::BackupRestore have worked?

Ron

Christian Walde | November 13, 2012 3:33 AM

This whole blog post seems to describe a guideline i've been following for a long while now:

If any paragraph of code deserves a comment to describe what it does, then it's very likely that it should be a sub, named after the comment.

Tags:

2 Comments

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry