December 2011 Archives

Interested in "The Year in Perl"?

A lot can happen in a year. Think back to 2005 and what we had and didn't have in Perl compared to now.

In previous jobs, I collected "The Year In Perl" a couple of times for Perl.com. This required a significant investment of time over a couple of days for the research and writing.

Perl.com these days is easier to update and to manage (though me carving out editing time is more difficult). What interest exists in putting together a document about the interesting developments in the Perl world in 2011?

In particular, we can concentrate on:

  • Community Events (especially significant developments such as the first or second occurrence of an event)
  • Important releases (5.14 counts, as well as big new improvements of existing projects)
  • Plans and announcements (Jesse Vincent's "Perl 5.16 and Beyond" stake in the ground, for example)
  • Products (development products, books, et cetera)

I have a small list on my own and will refine it if there's further interest. Feel free to reply here as a comment or contact me (chromatic at cpan dot org) as you prefer.

Perl Documentation in Terms of Tasks

The core Perl community—if you care to draw lines around a group of people who use Perl seriously and call that a community—is like many other core F/OSS communities. Real work happens on mailing lists and IRC. I unsubscribed from several mailing lists and deliberately spent as little time on IRC as possible this year, for various uninteresting reasons. (I haven't even made it to the Portland Perl Mongers meetings for several months.)

While that's been good for my productivity, it's also produced an interesting sense of disconnect, and that makes me wonder. Consider a thought experiment. Suppose you have six months to build a new green-field project. Your primary language is Perl. You're the only developer on the project, but you do have coworkers to do some of the non-coding work. You don't have access to IRC or mailing lists, but you do have access to the whole of the CPAN. In other words, your social connections are limited but your technical decisions are not.

In this situation, how do you find the best libraries and techniques to use for your requirements and how do you solve problems and get your questions answered?

Assume you have access to web forae such as PerlMonks and Stack Overflow and of course Duck Duck Go.

I can answer this partially for me: thank goodness for the degree of maturity the CPAN and its ecosystem encourages among its best projects. I have a lot of confidence in the stack I've chosen of Moose, Plack, DBIx::Class, and Catalyst, sprinkled liberally by great new tools such as perlbrew, cpanm, and Try::Tiny—but even so, the documentation and community support available without real-time discussion with contributors and developers isn't always sufficient to solve problems quickly.

(How interesting to note that all of these tools hew from a post-Perl 6 world, and how any Perl 6 implementation as it stands now only barely obviates the need for part of two of the named projects and deigns even to consider the others.)

For example, what's the best way to manage passwords and authentication in a Perl-based web application? Do you handle it at the Plack level or the Catalyst level? What if your user table doesn't match the example in the Catalyst authentication plugin example? How much better is bcrypt than SHA-1 or SHA-256? What if your business requirements mandate that users verify their accounts before they can login? How do you modify/subclass/extend/advise the plugin you use to meet this requirement?

Anyone who's done a few projects with this stack should be able to give a good answer to these questions, as should anyone who's spent a few weeks in the relevant IRC channels or a couple of months reading the right mailing lists. They're not difficult questions, but they are detailed questions. You could ask the same questions about the right way to manage DBIC schemas you expect to deploy frequently while allowing for schema updates and changes.

The interesting question isn't how to accomplish these things, it's how someone finds this information without mandating access to IRC or the mailing list.

I make the assumption that it's valuable to have multiple sources of information. We write copious documentation including ::Manual and ::Tutorial PODs in our top-level distribution namespaces, after all. We do an admirable job of producing Perl Advent Calendars (thanks, Andrew Grangaard!), but I'm very glad to see Catalyst retiring its calendar in favor of monthly articles. Publishing on a schedule is difficult, but the need for current information is present the other eleven months of the year.

I wish I could say that Perl and project wikis were more useful, but they seem neither popular nor currently useful to me. Maybe I looked in the wrong places. (I know I promised to give Catalyst a list of questions about things that weren't screechingly obvious; I have a list, but I haven't shown it yet. I have patched a few parts of the Plack documentation.) Yet it seems to me that for all of the energy and output of the core Perl community, the practical non-code results tend to be directed in ephemeral directions. In the past couple of months, people such as Gabor Szabo and Christian Walde spent a lot of time to improve the results for searching for "Perl tutorial" by creating a central place to list and evaluate Perl tutorials.

Again, maybe I looked in the wrong places—but I'd like to see a 2012 focus on making the knowledge and experience of core project members available further, in many other media. Perl.com always welcomes your submissions of course, but that's not the only persistent and updated medium for project knowledge.

If we want people to use our code and projects for real work, to solve real problems, and to accomplish real tasks, we need to continue to provide practical code and useful documentation at or above the high quality level we currently enjoy. Yet we also have to work to approach this audience from their point of view: in particular, in terms of the tasks they want to accomplish.

That is the resolution I suggest for the Perl community in 2012.

Don't TSA That Data!

A Vanity Fair article asks Does Airport Security Really Make Us Safer?. Fortunately, the writer of the article used Bruce Schneier as a source. (If you've been to an airport in the US, you know that the answer is "No; why would you even ask?")

The article's penultimate paragraph makes what should be an obvious point. (At least, it's obvious if you want to prevent terrorism as much as possible. If your goal is to spend lots of taxpayer money in a very flashy, showy way without worrying about efficacy, please continue.) In particular:

What the government should be doing is focusing on the terrorists when they are planning their plots. "That's how the British caught the liquid bombers," Schneier says. "They never got anywhere near the plane. That's what you want--not catching them at the last minute as they try to board the flight."

I read this article moments after sending an email commiserating about the silly (lack of) Unicode handling in a programming language which isn't Perl. Then something clicked.

One of my persistent desires for Parrot was to simplify the internals by reducing the amount of complexity and genericity in the core. In terms of Unicode, this means knowing the encoding of incoming data and the desired encoding of outgoing data, then transcoding to and from a single internal encoding. This way the core could operate on a single encoding and push the complexity of transcoding to the edges.

If Parrot hasn't changed this since I looked at it most recently, its string system requires each string to carry information about its encoding (which makes each string structure that much larger, increasing memory pressure) and each string operation to check for the need to transcode strings to mutually compatible encodings (which takes time for the comparison in every case, as well as time and memory for the transcoding in other cases).

Worse yet, string literals encoded in the source code of Parrot itself tend to have a specific encoding (ASCII or at least Latin-1 in the case of literals in the C code) and they ought to be constant, so transcoding in place isn't an option and, if you're working primarily with another encoding, that means always performing transcoding from that incompatible encoding.

It's not free to perform encoding at the edges, and you sometimes notice this when working with large chunks of data (though if you're processing multi-terabyte satellite images, treat them as binary and skip this encoding altogether), but it's the right thing to do.

The same principle applies for trusting incoming data. Secure it at the borders of the application. Don't spread those checks throughout the system. Harden the edges and don't let nonsense through. Fail early for suspicious things.

Otherwise you'll go mad trying to track down all of the possible interactions and possibilities of maliciousnesses that people could perpetuate if you lack a sane sanity policy. In other words, stop doing a lot of busy work to make it look like you know what you're doing. Do it right.

One of the persistent questions which keeps entrepreneurs on the edge is "Are we building the right thing?"

In the first web bubble, the Silly side of Silicon Valley chased vanity metrics such as "the number of eyeballs on the site" and "brand awareness" and "unique visitors". Those numbers are only interesting when you can correlate them to producing value for customers and bringing in real cash in the form of revenue.

I've enjoyed the book The Lean Startup by Eric Ries because he offers a much better mechanism to track the success or failure of any attempt to produce real value to customers. While split testing (or A/B testing) is useful to see how small changes lead to different customer behaviors, Ries recommends cohort analysis, where you can see the behavior of real customers through the sales funnel and correlate the X-axis with individual changes to your business or product.

That means tracking customer behavior. If you're building some sort of software as a service product, and if the mechanism of delivery of that product is primarily a web site, you probably already know the punchline.

Assume I already know how to identify and log events for each salient customer action type. (I've built that kind of system before.) Assume I don't want to collect personally identifiable information (I don't). Assume I'm using Plack and its middleware heavily, and assume I'm happy using Catalyst as a web framework.

How can I identify unique users (with and without accounts) on a daily basis, anonymize them, but group their actions across the site such that my automated daily cohort graphs correspond with reality?

So far I've identified few points of possible contention. I can rely on browser cookies for unique identification of users if I know that user sessions have unique identifiers within a 24 hour period. (I could generate GUIDs for this, but that may be overdoing things.) I think< I also have to track the transition from anonymous visitor to authenticated user, but I might be able to convince myself that either replacing the current session or smple subtraction of successful login events from total number of unique anonymous visitors would give the right numbers.

(I also haven't dived much into how Catalyst 5.9 and Plack interact in terms of session and cookie handling. Everything's just worked, so I've ignored the details until now.)

I don't mind building such a system if necessary, but if all of the pieces are out there and available—or if someone's already built this and can give guidance—so much the better.

Have you solved this problem? If so, how did you do it? If not, how would you do it? Would you handle logging at the Plack level or the application level? Would you worry about tracking session changes? Does Catalyst need to know about this?

When Print Debugging Fails

| 2 Comments

I have a medium sized project which is effectively a state machine. While I keep promising to write a reusable modular system which lets you specify the states and transitions between the states and let behavior manage itself, I haven't done that yet.

This means that occasionally I have to debug the transition logic.

Suppose I have a series of articles in a publication queue, and suppose each article has a state() method accessor/mutator. Moving an article between states (from SOLICIT to EDIT to PREVIEW to PUBLISHED) means calling state() and passing a token which represents the appropriate state.

Because I haven't yet consolidated all of the transitions into a single place, an article's state may change in any of half a dozen places in the entire codebase. That's not awful, but if state transitions are not occurring as I expect, that's multiple places to watch as I debug.

I rarely use the Perl debugger. (I'm a fan of debuggers for compiled languages such as C, and I've used debuggers in IDEs for languages which require IDEs to great success, but I've never found Perl's debugger productive.) I usually annotate my code with log messages and bisect problems that way.

This seemed easy today; use Moose advice to surround the state() method and display some logging information. (Shouldn't this be a pattern already? Certainly there must be something on the CPAN to accomplish this.)

around state => sub
{
    my ($orig, $self, @values) = @_;

    return $self->$orig() unless @values;
    my $original = $self->$orig();
    my $title    = $self->title;
    my @caller   = caller(2);

    print STDERR "Setting '$title' from $original to $values[0] " .
                 "from $caller[1]:$caller[2]\n";
};

If you already see the bug, you're doing better than I am today. After five minutes of head scratching, and looking elsewhere, I figured out why my logs showed the first transition happening successfully but nothing else happened.

The moral of the story is to be very careful what you measure, lest you change that which you observe... or in my case, fail to allow that change to occur.

The Catalyst web framework uses Perl 5 function attributes effectively—I've seen few more effective uses of attributes.

Any modern web framework has to deal with the idea of routes and request routing somehow. Given a request path (such as /stocks/AA/view_analysis), how does your application know what to do?

Catalyst solves this elegantly with a feature known as chained actions. Controller methods can consume zero or more parts of the path but, when explicitly chained, can combine. Consider the example request path. The controller is Stocks.pm. The second component of the path (/AA) is the identifier for a stock (Alcoa, to be specific. I'm neither long nor short on Alcoa itself, though I probably own some shares as part of a fund somewhere.) The final component of the path, /view_analysis, is an action—a verb representing an action the controller should take on the object representing Alcoa in the system.

You can probably start to see the idea of the chain right away.

The Stock controller has a controller method called get_stock which grabs the stock symbol from the request path, looks it up in the database, and stores the object representing that stock for further processing. If no such symbol exists, it throws an exception.

The view_analysis method chains off of the get_stock method such that Catalyst will only dispatch to view_analysis when it's already successfully dispatched to get_stock. Unless you write a custom dispatch system which bypasses the dispatch rules, users will never be able to call view_analysis without a valid stock object available.

(Further, these methods are part of a chain which requires that users have successfully logged into the system; they chain off of a user authentication system.)

In code terms, the relevant attributes look something like:

sub authorized :Chained('/login/required') :PathPart('stocks') :CaptureArgs(0);

sub get_stock :Chained('authorized') :PathPart('') :CaptureArgs(1);

sub view_analysis :Chained('get_stock') :PathPart('view_analysis') :Args(0);

The :Chained attribute is most relevant here. :PathPart governs how Catalyst's dispatcher makes each method visible to user requests (get_stock doesn't consume a part of the path on its own, while authorized consumes the name of the controller and view_analysis consumes its own name). :CaptureArgs and :Args control how many other pieces of the path the methods consume; in the case of get_stock, it's the single path element between /stocks and any subsequent chained actions—in this case, /AA. As view_analysis is the end point of a chain, you use :Args instead of :CaptureArgs.

With that all explained, request method chaining is fantastic. I can reuse get_stock() for other request methods and get all of its benefits, including the fact that only authorized users can even reach this point.

Yet I want to prove these characteristics of my application.

I want to prove these features so definitively that I don't want to write tests for them. I want my program to fail to compile if these characteristics are untrue.

I see chaining from get_stock() as supplying an invariant precondition to view_analysis() such that it proves, to my satisfaction, that I can always rely on a valid stock object being available within the analysis method. Always. Similarly, I can always rely on a valid user being available within both methods. Always always.

The problem comes in that it's easy to make a typo in the name of a chain or a method, or to use :CaptureArgs instead of :Args or vice versa.

Here's the thing: all of this metadata is metadata. All of this information is available at compile time, before Perl has to execute anything.

If I had a really good and extensible type system in Perl 5, I could write a couple of pieces of predicate logic to say that every chained method should be a starting point or have a valid predecessor. These are trivial properties of my program (no matter how large it gets) and they're resolvable with the information available at the point of compilation. Even with complex controller construction through the use of roles and parametric roles, this information is available.

I know how to emulate this behavior by injecting some sort of CHECK block into the code and schlepping through the symbol table and inspecting attributes myself, but that's emulating a useful feature we could exploit in a lot of ways.

Forget the talk about making Perl into Java or C++ by adding a silly manifest static type system. We could find and fix real errors in logic—trivial errors, trivially discoverable—if we had an extensible type system which let us define our own simple predicates.

(Implementing such is left as an exercise for a small army of readers cloned from a very small army of brilliant p5p hackers with copious spare time and a habit of reading ACM papers before breakfast.)

Track App Progress with Writeable $0

| 6 Comments

In Perl 5, $0 is the magic superglobal which contains the name of the program being executed. This is the name you see in the output of ps or in the top utility.

Some clever programs provide several symlinks to the main program and examine $0 to enable or disable certain behaviors. This is an easy way to hide the details of execution from users while making those behavior mnemonic.

I usually don't write those kinds of programs, but this past year I've written several batch processing programs which have several interdependent states. For example, one program runs from cron regularly to run through a pipeline of behaviors. Data moves through that pipeline; it's basically one big state machine.

The core of the program is a pipeline manager which runs the appropriate processing stages in order, such that on every invocation, the program moves data through at least one stage and potentially every stage. It doesn't have to move everything through the pipeline all in one invocation, but it does have to make progress on every invocation.

For various uninteresting optimization and locking reasons, I made this program a single execution unit. (I do use asynchronous IO for things like network access, but that's because the program is largely IO bound.) The program also has copious logging of the stage traversal, split between one log which tracks stage transitions and timings and stage-specific log files which have more details on the progress of those stages.

Until a few minutes ago, the easiest way to see the program's current stage was to tail the top-level log file. While running some live tests on a new feature, I found myself with free time and the desire not to switch back and forth to a tail -f screen again, so I checked the documentation for $0 again.

I knew that on certain platforms (GNU/Linux, which makes my life easier) you can actually write to it. If you do this, you can control what appears in the output of ps and top.

Every stage runs from a closure (shades of Plack):


    my $sub    = sub
    {
        my ($self, $config) = @_;
        my $log             = $self->get_fh_for_step( $config, lc $app );

        # show app stage in ps output
        local $0 = $app;
        my $app  = $module->new(
            logger   => $log,
            map { $_ => $config->{General}{$_} } @keys,
        );

        $app->run;
        $log->log( sprintf $message, $app->count ) if $app->count;
    };

A loop in the pipeline manager creates a new closure over the name of the module which implements the stage to create a new object for the stage, set up the logger, provide the appropriate configuration, and run the stage. The emboldened code shows the change I made.

Right now, my top window shows that the image processing stage has just given way to the report writing stage—and now the program has exited. In a couple of minutes, everything will start again.

Writing this entry took longer than implementing this feature. Five minutes of experimenting has improved the visibility and monitoring of this program immensely. Maybe it'll help you.

If you're not using perlbrew to manage your Perl installations, you're missing out. You can leave the moldy old system Perl 5 as is and use a supported and modern version of Perl 5 for your current applications. You can even use separate installations for separate applications.

The only drawback I've ever had with perlbrew is that upgrading my main development version of Perl 5 in place (from 5.14.0 to 5.14.1 to 5.14.2) required the reinstallation of a few hundred CPAN distributions. That shouldn't be an onerous task, but it is for two reasons:

  • I don't have a canonical list of every dependency in every active project in a single place; I've had to make every project reinstall its dependencies, which is busywork
  • It's unnecessary, as minor releases of Perl 5 compiled with the same characteristics (64-bit, non-threaded in my case) are binary compatible with each other, so XS components are compatible

I've noticed that binary distributions of Perl 5 tend to share @INC directories between versions, so why not perlbrew?

As it turns out, this is a compilation option to Perl 5 itself. The -D otherlibdirs option adds a directory to @INC as included within the perl binary itself. perlbrew allows you to pass -D options, so my invocation looked like:

perlbrew install -j 9 perl-5.14.2
    -D otherlibdirs="$PERLBREW_ROOT/perls/perl-5.14.1/lib/site_perl/5.14.1"

The -j 9 option performs a parallel build and test.

Sharing the site_perl directories between 5.14.1 and 5.14.2 saved me a lot of time, but I did notice one caveat: any distribution which installs helper programs (, Pod::PseudoPod::Book) doesn't automatically make those helper programs available. I had to fix that by installing those distributions by hand. That was the work of seconds, not hours, so it's still an improvement.

Again, this trick only works if the new build has the same binary characteristics as the old build. If you're using the same build options on the same machine for a Perl 5 in the same major version family, you should be fine.

Controlling Test Parallelism with prove

| 1 Comment

If you're fortunate enough to have a test suite which allows parallel execution, one small prove feature can save you a lot of time.

prove, of course, is a relatively new utility included with Test::Harness and TAP::Harness. It handles many of the little details of running test programs and collecting and reporting the output; it's one of those utilities that looks really silly before you use it, and then becomes indispensible within a week.

prove of course has an option to run parallel tests. The -j# option allows you to specify how many test files to run at once. I've had good success with -j9 on my desktop machine; the right number depends on your tasks, the number of cores available, the amount of memory used by each process, and the runtime characteristics of each process.

prove's -l option adds the relative lib/ directory to Perl's include path, so that you can test pure-Perl code without running it through a build cycle or without having to add use lib '...'; lines to your test files.

The -r option searches a given directory recursively for .t files.

Thus, the command prove -lr -j9 t/ runs all of the .t files found under t/, up to nine at a time, and prefers modules found under lib/. This is useful.

Of course I have a shell alias with one more feature:

alias proveall='prove -j9 --state=slow,save -lr t'

prove's state flag saves information about the tests run. If you save state, subsequent runs can use that information to determine how to run tests again.

I often have several types of tests, especially for code with user interfaces and data models. The data model tests exercise business logic, and the UI tests exercise control flow and error handling. Usually the business tests take the longest to run—and usually only one or two test files take the most time. When prove saves the state of the test run, it can schedule those slow tests first so that the fast tests can run in the spots where the slow test blocks.

Again, this all depends on your workload. Much of my code is more IO bound than CPU bound. I've seen slow tests take 20% or more of total suite execution time after everything else has finished just because they have so many points where they have to wait.

I regularly have test suite times under 30 seconds (often closer to 10 or 12 seconds) on moderately large projects because I can exploit easy opportunities for parallelism. Certainly the right tweaking and scheduling could get me more benefit, but running proveall and making sure that parallelism is possible from the start gets me most of that benefit with almost no additional work.

(This isn't solely an academic obsession; in my measured personal experience, the more often I can run the entire test suite, the easier it is to find and fix bugs. I won't go as far as to say that continuous integration is a crutch, but if you're using CI and can't run the most important tests covering most of your code in 30 seconds, you're shortchanging yourself.)

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

sponsored by the How to Make a Smoothie guide

Categories

Pages

About this Archive

This page is an archive of entries from December 2011 listed from newest to oldest.

November 2011 is the previous archive.

January 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.


Powered by the Perl programming language

what is programming?