November 2012 Archives

Tests Have APIs Too

Good code tells a story.

Some code describes entities—perhaps you're designing objects and classes, or perhaps you're declaring data structures and types. Your code describes what you're working with, its essential attributes, and what you expect to do with it.

Other code describes rules and interactions. Perhaps you're writing methods or business rules or even grammars. Your code demonstrates what you're doing to your entities and, if you're very good, why you're doing it. Great programmers aspire to writing descriptive code, where what happens and why is so obvious that even novices can understand the details at a high level.

If you're like me, you might prefer to let high-level design emerge from the interaction of smaller components. You may have heard of the rule "Once, twice, refactor", which suggests that when you notice you're writing similar code for the third time, you unify all three elements into a single abstraction you can reuse. I try to follow that rule, because usually the requirements are solid enough at that point that I can extract a useful and usable abstraction from the concrete implementations.

I try to let that rule guide my designs in the large—not that I ignore large designs (I have lots of experience from which to draw, of course)—but that every problem is a little bit different. In a sense, my experience is a catalog of ways I've developed the APIs I use to solve each individual problem. Software design at that level is an exercise in producing usable APIs tuned to the specific problem domain.

(Software patterns, in the original sense, are a way of identifying the similar elements while allowing for local differences because of individual details.)

The same strategy applies to tests.

Where the basic unit of computation may be something like an if condition (give me an if statement and a way to read from and write to memory and I can eventually recreate a useful programming language), the basic unit of testing is probably the ok() function. It's a boolean assertion. It's true or false. Either the test passes or it doesn't.

Everything built on top of that ok() or assertTrue() or whatever your preference is an abstraction, and abstractions await discovery.

When Schwern and I discovered Test::Builder, we extracted that central behavior—ok()—from multiple places into a single entity of abstraction that many multiple places could share. When I use Test::Class or Test::Routine to share behavior between individual tests, I do that because it affords abstraction.

For the same reason I've invested time recently in making my individual tests read as clearly as possible—as clear as the rest of my code, if not clearer:

#!/usr/bin/env perl

use Modern::Perl;
use Test::More;
use MyApp::App::DedupEntries;
use MyApp::States ':entry';
use lib 't/lib';

use TestDB qw( init_entry get_schema );

exit main( @ARGV );

sub main
{
    test_entry_dedup_all_dupes();
    test_entry_dedup_some_dupes();
    test_entry_dedup_near_dupes();
    test_entry_dedup_far_dupes();

    done_testing;
    return 0;
}

I wrap individual tests in functions (or methods, depending on the test library). Each group of assertions has a name. Every assertion has a description. I often/usually extract setup and teardown code from test functions and methods into helpers so as to produce a named API for individual tests.

This makes tests easier to write and to manage, but it also makes them easier to write and to maintain and to debug.

(I've given tests to other developers to show them how to use APIs I've developed in the regular code.)

Testing has helped me improve my code measurably over the past decade in terms of quality and efficacy. Treating test code like I'd treat any other code has made a difference in my ability to write and maintain coherent, useful, and usable tests. It's perhaps the most effective way to sharpen your tools.

(You do need to recognize a greater need for simplicity in your tests, however, but that's a subject for another article.)

How Bugs Get Fixed

To write a robust program, you must manage the small details. Many of these details you will only discover when real people use your program with real data. A program becomes robust only when it can handle unanticipated cases sensibly and successfully.

If and when you evaluate a free software project for the first time, you have the opportunity to help that program become more robust, especially if the platform or data or tasks you have in mind are sufficiently different from those the authors have already explored.

In other words, if you find bugs, the authors may not be aware of them. If you do not report them, they may not get fixed.

I hesitate to suggest that you have an ethical obligation to report bugs, especially if you end up not using the software, but you ought to consider that unreported bugs may never get fixed.

(When I tried to revive the Perl SDL project several years ago, I couldn't find anyone willing or able to try to build the software on Windows and report back debugging information, even though I could find a handful of people willing to say "It doesn't work on Windows". As a consequence, the Perl SDL bindings only started to work on Windows thanks to the superheroics of Kartik Thakore and the other contributors.)

Note again that I did not write that you have an obligation to report bugs. Nor did I write that you cannot complain about software unless you have reported bugs. I merely wrote that unreported bugs tend to remain unfixed.

One of the most cogent criticisms of the book Design Patterns is that too many people read it as "here's a list of characteristics your software should exhibit" instead of "here's a catalog of common design elements many projects exhibit". Rather than spreading a common vocabulary, the patterns book became yet another set of buzzwords to use to spice up developer CVs.

The backlash should have been predictable; it's almost a law of physics by this point. For every grand unifying pronouncement about a shiny new way to improve software development, cue a grand group of people ready to disclaim it as unnecessary, overcomplicated, a warmed-over rediscovery of something at least thirty years old, or a silly way to sell books and consulting. In this case, the naysayers had a point. Suddenly, global variables weren't bad. They were instances of the Singleton pattern. Oh frabjous day.

(Fortunately in these enlightened times, we use inversion of control and dependency injection to make our singletons, and we control them with great swaths of barely-typed XML or JSON. We're so modern it hurts.)

... not that singletons are always bad.

Some concerns really are global. Your logging framework is probably a global concern. Your configuration file is probably a global concern. Here's a secret, though: these things probably oughtn't be mutable. The real concern is mutable global data. (A secondary concern is too much coupling to concrete instances, but that's a different article.)

Sometimes you can go a little too far the other direction, though.

I have a test suite that's too slow for my taste—about 2500 assertions which run in 70 seconds. I'd love to get that down to 20 seconds, but under a minute is definitely an improvement.

The project has a configuration file in config/settings.yaml and a corresponding Project::Config module which loads the settings and offers an API to its contents. Other modules within the system access the configuration file by loading the module and calling methods on it.

Because this configuration information is global to a process, the configuration module stores the data structure containing the configuration in a lexical variable global to the module:

package Project::Config;

my $config;

sub load_config
{
    $config = load_config( 'config/settings.yaml' );
}

1;

Everything was well and good until I saw YAML taking up more than 10% of the execution time of one of the test files. I traced it to the configuration module... which loaded the configuration file anew from its import() method.

For every other module in the system which used Project::Config, it dutifully re-read the configuration off of the disk.

The tests run a little faster now.

This was a silly little pessimization anyone could have made, but it illustrates two interesting points. First, it shows that whoever wrote this code (I don't know who and I didn't look, because it could have been anyone) clearly had singleton concerns in mind, and rightly so. This is process-global data and it deserves to be available everywhere. Sure, the loading was a pessimization, but that's fixed now and everything still works. Success.

I take more interest in a subtler question: how should other modules within the system access this configuration data? The current access pattern is use Project::Config and call methods that way, but that demonstrates the concrete coupling problem I alluded to earlier, and it certainly exacerbated the multiple-loading problem I fixed. What if, instead, something external to the system could somehow inject an already-instantiated configuration object into the other entities, such that none of them had to couple themselves to the concrete module-name-to-filepath-mapping that eventually called Project::Name's import() and reloaded the configuration file?

Yes, that probably would have hidden the pessimization from my traces, but would that have mattered? It would also have hidden the effects of that pessimization.

That's not the only goal of my development process, but it's a benefit, and that's something to consider.

API as Documentation

| 2 Comments

In all of the silly kerfuffle about how awesome rock star pirate ninjas all write their own domain-specific languages, people sometimes can't see past all of the bravada falsa for a serious point. (That's excusable: when your reason for doing something is to show off how awesome you want people to think you are, you don't always recognize the useful things that you incidentally happen to create.)

When "writing a DSL" becomes less about "Look how awesome I am too" and more about "I want to simplify this code" or "I can make this easier to use" or "There's an abstraction here that removes a lot of boilerplate" or "It's safer to write code this way", you can sort of edge sideways into realizing that the right API can describe your problem and the solution in a way that the wrong or at least the naïve API cannot.

I wrote some code to migrate data from a SQLite database to a PostgreSQL database. (DBIx::Class::Migration helped, but DBIx::Class::Fixtures turned out not to work for uninteresting technical reasons. (I've filed a couple of bugs on this before and the maintainers have fixed them, but the constraints of this particular project are way outside what that module can reasonably handle.)

The easiest solution that would work in the allotted time and space was to write my own importer from CSV files dumped out of SQLite into PostgreSQL. The only problem was matching foreign keys. (Yes, I know about deferred constraints and bulk loading. Unfortunately, SQLite's laxity made the dataset a little less robust than I wanted, hence the move to PostgreSQL.)

The CSV files contain a primary key for most tables: an articles table might have an article_id column, where a references table might refer to an article by its article_id. By inserting tables in dependency order, foreign key resolution is much easier... unless you let the database remap primary keys.

I used Text::CSV_XS to fetch each row from the CSV files. That gives a $row which contains an anonymous array of values for the specific row in the table. Some of those values contain foreign keys which the code must map to the new ids.

You probably already know what's coming:

while (my $row = $csv->getline( $fh ))
        {
            # remove the id
            my $prev_id = shift @$row;
            $sth->execute( @$row )
                or die( $sth->errstr . "(@$row)\n" );

            $ids{$table}{$prev_id} =
                $dbh->last_insert_id( undef, undef, $table, 'id' );
        }
    }

As you probably guessed, a hash maps existing IDs from rows to their new IDs as inserted into the database. To make this work, the code has to perform a fixup (see, Everything is a Compiler

!) which fixes the foreign keys for every row in every table. For example:

sub fix_entry_images
{
    my ($ids, $row) = @_;
    swap( $row, $ids, images  => 0 );
    swap( $row, $ids, entries => 1 );
}

This is all really boring code, except for swap(), which is exceedingly boring code:

sub swap
{
    my ($row, $ids, $name, $pos) = @_;
    return unless $row->[$pos];
    $row->[$pos] = $ids->{$name}{ $row->[$pos] };
}

That would be easy to write in line in each of these functions, but look again at its use:

    swap( $row, $ids, images  => 0 );
    swap( $row, $ids, entries => 1 );

Yes, that's shorter than writing it inline, but it's also a lot clearer. I had to debug a couple of bugs (I wrote at least two bugs in this when I first wrote this code) and it was immediately obvious what they were when I saw what I'd written wrong. (I had the pluralization of a table wrong, because I had repeated table names in multiple places.)

I'm not silly enough to claim that a single function definition makes a DSL or pidgin or embedded language. Not at all! But writing a function here and making it at least somewhat obvious what's going on and why means that seeing the bug and fixing it everywhere it's present is very possible.

I've seen a lot of novice code that packs functions full of as much code as the coder can keep in his or her mind at a time. My functions and methods are as small as possible—sometimes as small as I can imagine them to be while still giving them names. I've learned this the hard way: even if I don't intend to reuse a function from multiple places, the discipline of giving it and its arguments distinctive and sensible names forces me to understand what's really going on.

As in this case, if I'm careful about all of this information, it also can help make what's happening—and why—clear.

When you work with other people on a project, you either end up adopting some of their coding styles or you end up not working with other people on that project. When I've worked with teams of developers instead of on my own, I've noticed that a lot of Perl code tends toward ProjectName::Parent::Class::Hierarchy::Package::Name unless very carefully pruned with something like roles.

I found myself in a situation like that recently, and I noticed it because I found myself typing class names like ProjectName::Order::Family::Genus::Species::Subspecies repeatedly. The obsessive automator in my brain said "You should find a way to shorten that". Occasionally I've written code like this:

package MyClass {

    sub entity_class    { 'ProjectName::Entity' }

    sub container_class { 'ProjectName::Container' }

    sub do_something
    {
        my $self      = shift;
        my $entity    = $self->entity_class->new( ... );
        my $container = $self->container_class->new( ... );

        $container->add_entity( $entity );

        ...
    }
}

This has a couple of benefits:

  • It isolates the names of the dependent classes into a single, overridable (even injectable) place
  • It's shorter
  • It's unambiguous to parse, in that Perl will always understand the method call used to call the constructor as a method call

Recently I found myself thinking about Python instead. Even though Perl 5 and Python have essentially the same object system underneath (everything is a big bag of names and values and methods aren't all that special and there's essentially no distinction between a function and a method and did you know that everything you import into a class becomes part of its public interface?), the syntax differs. For example, Python doesn't have explicit constructor methods. To construct an object, call a function with the same name as the name of the class:

obj = Class();

That has the advantage that it's shorter and unambiguous. (It has the disadvantage that it looks like a function call, unless you hew to the convention that functions all start with lowercase names and class names all start with uppercase names.)

It's still shorter.

I found myself idly wondering if there were some way to make this available in Perl. (I know about aliased, but I haven't used it with something like namespace::autoclean, which is the obvious improvement.) How would that look?

I like the brevity and I like the unambiguity in parsing, but writing:

use aliased 'ProjectName::Order::Family::Genus::Species::Subspecies';
use namespace::autoclean;

my $species = Subspecies( ... );

... just feels wrong to me. It's a class method. Invoking a class method should look like invoking a method.

Way back in the day, the Perl 6 discussion turned to first-class classes, and we talked about anointing a new sigil. (My favorite was ¢.) I don't know how well this would work in Perl 5, but free yourself from the constraints of the actual for a moment and consider the possibilities. If we could refer to classes as first-class entities literally in source code:

  • We could give them unambiguous aliases without confusing them for functions or methods
  • Constructing an object via the class object would never be ambiguous to the parser
  • Method lookup and concomitant optimizations would be easier
  • We might be able to find more typeful optimizations
  • Similar things (method invocation) would continue to look similar, while different things (object method invocation versus class method invocation) would look different

Of course, you could just as well argue that Perl 5 doesn't need another sigil.

I haven't convinced myself that I like this idea, but it does keep coming up. It has the feel of an intriguing idea, but I can't tell whether it's good, bad, or ugly. The right approach may instead be to flatten the hierarchy of classes in the program so that names are shorter overall... but even so, the idea of first-class classes has its benefits.

I suspect, but cannot yet prove, that one of the reasons for dissatisfaction with modern programming languages as well as one of the reasons that the call for breaking backwards compatibility is that it's difficult to predict what people will use a language for and, as such, it's nigh unto impossible to get the language's API right the first time. Reality is that which, even if your programming language will not admit it, never goes away.

(I own an otherwise good Haskell book which uses a custom mathematical notation for Haskell operators. You cannot actually type this notation and have Haskell accept your code. Instead you must flip to the back of the book for a translation table between the author's preferred typographic notation and what the Haskell language actually supports. Also the book has typos.)

Because of the mathematical foundations of programming, there's a long-standing trend to reduce any programming language to a simple and consistent and irreducible set of independent axioms. You sometimes hear of this as the theoretical axis of programming and sometimes "the formal core of a language". In theory—the land of programming language research, where the practical use of a language is less important than the degree to which a language explores a new or interesting design principle—that formal core is all important. It allows you to reason about a thing.

In practice, most programmers want to accomplish a task without having to digest sizable chunks for the Principia Mathematica. Then again, they also want to learn a few distinct ideas about the language's underlying philosophy (though only by osmosis, which is a subject of a different article) so that they can reuse those ideas in other parts of their programming.

In other words, most people approach programming from a similar point of view they arrived at from different directions. The theorists want to start from a small set of axioms and reason outward, while the practitioners want to learn only a little bit and gradually absorb the rest while they need it. (Okay, most of them don't want to absorb anything, but they do, and I'm happy to evaluate their wants based on their behaviors and not what they say.)

The result sometimes works. Other times, it leads to hasty generalizations.

Consider Perl's taint mode.

With taint mode enabled, all external data has a taint associated with it. If you use tainted data in an insecure way, Perl will complain. Before you can use this data safely, you must untaint it. (Who said Perl doesn't have a type system?)

So far so good.

How do you untaint data? You extract part of it with a regular expression capture.

Back up from the practical implementation for a moment and consider the language theoretical axis:

  • We have data marked as tainted
  • We must untaint that data before using it in secure operations
  • Untainting that data implies validating it somehow
  • We can use regular expressions to assert properties about data
  • Therefore, the right way to untaint data is to apply a regular expression and capture a subset of that data

You can see the reasoning, but you can also see the leap of logic in the fourth point. Yes, you can use regular expressions to validate data, but it's neither the only way to validate data as trustworthy, nor is that the only purpose of using a regex capture group.

For example, a web application might provide a URI something like /studies/010874/review where 010874 is the primary key of a study table. Before using a client-supplied value in a database query, you might rightly want to validate that that key is safe.

A very simple untainting might check that that number is composed soley of digits. That might not be sufficient. (If you're using a form processing module, it might have already done this for you if you specified the input parameter as a positive integer.)

A valid id must match an actual database record. It might need to match a database record where another column (active, for example) has a true value.

You cannot (easily or sensibly; I know about executing arbitrary code in a regex, and if you call out to a database from there, you'd better be showing off and not serious) encode this kind of validity checking into a regex. There's simply not enough mechanism there to express the necessary intent.

... or the aforementioned form processing module might have performed a very simple regex and untainted the value for you already without intending to.

Likewise a date might match a date-handling regex even if that date is in the future and, in your application, invalid.

The PerlMonks thread Taint Mode Limitations makes this point as well. A separate untaint builtin—and no implicit untainting from capture groups—would allow programmers to express their intent more clearly. In the case of writing secure code, the clarity of intent seems much more valuable than the desire to reuse an existing feature and save a new keyword.

Update: I had forgotten about Taint::Util, which does the right thing for the minor cost of installing a CPAN module.

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

affiliated with ModernPerl.net

Categories

Pages

About this Archive

This page is an archive of entries from November 2012 listed from newest to oldest.

October 2012 is the previous archive.

December 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.


Sponsored by Blender Recipe Reviews and the Trendshare how to invest guide

Powered by the Perl programming language

what is programming?