May 2010 Archives

How to Parse Perl 5 on the JVM

| 1 Comment

I have a simple rule for judging the accuracy of a newspaper, a periodical, a television program, and any other mechanism intended to report events accurately. If I read a report about an event I've attended or a subject which I know in detail and I find factual errors that modest research would have corrected (errors of opinion are fine, though why do even the mothers of Paul Krugman and Thomas Friedman read their nonsense?), I assume that said venue is as wrong about subjects about which I know even less.

Yes, it's been three months already, and the PERL IZ IMPARSABLES!!!! subject has come up yet again.

This time, a message from late last year lamenting the fizzling out of a project to port Perl 5 to the JVM spurred an awful Reddit headline (but I repeat myself).

Everything I wrote in On Parsing Perl 5 still applies. If you haven't read and understood it, opining about what Perl 5 does and doesn't do and what this does and doesn't apply contributes only to the endless 4channing of the Internet, and you should stop doing so.

With that said, with everyone now understanding that only undeclared barewords, modified operand contexts, and unparenthesized arity changes have any bearing on the successful static parsing of arbitrary Perl 5 code, the problem isn't as difficult as the email makes it sound. It's not trivial, and I'm not volunteering to do it, but it's not impossible and it doesn't require reimplementing Perl 5's parser in Java.

The secret is indirection.

Consider the code:

say foo 1, 2, 3;

Without any prototype on foo(), Perl 5 considers it a listary function which slurps up the arguments 1, 2, and 3. Its return value, if any, is the single argument to say.

With a prototype on foo(), the proper interpretation might be to consume only one argument, or only two arguments. At this point some people look at the static parsing possibilities of Perl 5 and throw up their hands. Yet the only place where any Perl 5 implementation really needs to know how many arguments the bare-word foo without parentheses needs to consume is at the point of execution...

... at which point foo() has a prototype, whether the "No one has given it a prototype, so it's listary!" or a defined prototype. In other words, all a static parser needs to do when it encounters a syntactic situation like this where the runtime behavior is uncertain is emit code that can look up foo's prototype when it comes time to evaluate the containing expression.

(A polymorphic inline cache could even make such code efficient, unless you want to allow code rewriting... but optimization takes place after the proof of concept works.)

Your syntax highlighter still might get this wrong, but executing Perl 5 code correctly in this case is still possible even if you don't evaluate code during compilation. One caveat to this approach is that certain compilation errors are difficult—the strict pragma's distaste for undeclared barewords, for example—though there are ways around that, as well.

Even so, as the mailing list message suggested, eliminating (euphemistically, discouraging) certain Perl 5 syntactic constructs would make parsing Perl 5 much more effective. I'm not sure how to handle source filters, for example.

With all of this said, the goal of not writing a Perl 5 parser for Perl 5 on the JVM is still troublesome, because any reasonably complete Perl 5 implementation still has to support string eval.

(See also A Modern Perl Fakebook.)

I needed to extract all hyperlinks from an HTML document today, and I needed to remove all markup except for the simplest formatting: paragraphs, emphasis, and bold. Any experienced Perl 5 programmer knows that multiple CPAN distributions exist for doing just this. You can choose whether you want an XS wrapper around an existing C library, or a smattering of regular expressions, or a DIY HTML parser, or a simple wrapper around an HTML parser.

You can know that, but that doesn't tell you how to do it.

I did my research and decided on HTML::Scrubber to remove markup. Its documentation suggests a strategy more complex than my current needs, so I eventually produced:

my $scrubber = HTML::Scrubber->new(
    allow => [qw( p br i u strong em hr )],
my $scrubbed = $scrubber->scrub( $content );

For extracting hyperlinks, I used HTML::LinkExtor and wrote:

sub get_links
    my ($self, $content) = @_;

    my $p = HTML::LinkExtor->new();

    my @links;

    for my $link ($p->links())
        my ($tag, %a) = @$link;

        next unless $tag eq 'a' && $a{href} && $a{href} =~ /^http:/;

        push @links, $a{href};

    return \@links;

I don't mind doing the research and customizing snippets like these for my specific needs, but I can imagine countless other people needing examples like this. If I weren't already convinced that the world needs a new resource for copy and paste examples in Modern Perl, I would be.

(If you're still not convinced, consider how much more easily a novice could find these examples than writing a correct and comprehensive regular expression for either case.)

Whenever someone suggests that innovation is the sine qua non of anything in the world of technology, lock your doors, put one hand on your wallet, and send the kids inside. (Me, I also close the tab and delete the email.)

Consider your privacy. Consider whether a company whose sole business purpose is to sell targeted advertising by extracting, sharing, and mining personal information from millions of people. (Do I mean Google or Facebook?) A fan of free markets and free information might say "Whether any individual finds that objectionable, an informed citizen should have the right to decide whether to participate."

The assumption is that citizens should be able to know what information the business shares, with whom, and why. It's easy to understand why. Even though it's funny to laugh at a petty criminal caught by posting a stupid update which gives the authorities reasonable suspicion of a crime, it's unsettling to imagine an abused and estranged spouse who's dutifully followed the guidelines for not allowing personal information to escape a hand-vetted circle of trustworthy family and friends fall victim to an unannounced, opt-out rule change which allows the abuser to perpetuate the abuse.

The parallel to Perl is subtle, but important. If you want to improve your software, sometimes you have to make incompatible changes (or learn how to predict the future, or write software so trivial that you always get it right the first time), but changing the world out from under users is irresponsible and unethical.

The difference between managing your privacy and safety in a world where mass communication is cheap and easy and between having to change the use of a long deprecated feature is vast, but in both cases there's a simple ethical concern: retroactive and arbitrary and unannounced and forced changes are wrong.

That's why My Contrarian Stance on Facebook and Privacy is very, very wrong:

[Let's] not make privacy a third rail issue, pillorying any company that makes a mistake on the privacy front. If we do that, we'll never get the innovation we need to solve the thorny nest of issues around privacy and data ownership that are intrinsic to the network era.

Don't let the shiny of the Internet hero de jure fool you. The technical world needs less "innovation". (I'll take less of the kind of "innovation" which suggests that if I didn't want to share information publicly last month, I suddenly want to this month without you even asking.) The technical world needs more grownups who don't get distracted by multibillion market caps, the inherent sexiness of enterprise software, and whatever His Jobs announces next week. The technical world needs people who treat other people ethically. If that means pillorying entities which act unethically, then so be it.

Caleb Cushing has suggested multiple times that developers of free software should consider support an obligation and make support a priority.

I can agree with that as a categorical imperative, but I can't agree that releasing free software induces a requirement to do so. For example, any distribution you upload to the CPAN should contain a comprehensive test suite which suffers no false negatives and offers no false positives. Yet not all CPAN distributions do so. CPAN itself requires no test suite, and plenty of useful CPAN distributions lack comprehensive test suites, and few CPAN distributions have neve suffered from false negatives.

Certainly the utility of CPAN would increase if our test suites trended toward perfection, but requiring perfection would likely suppress the utility of the CPAN in the long term.

Ever since I gave up the dull periods between crises of system administration for what has become my career (and a mortgage and family obligations and hobbies which do not require me to sit in front of my computer all night long), I have had to prioritize how I spend my time. Sometimes I add new features. Sometimes I apply patches. Sometimes I revise documentation.

Yet the fact that I wrote a long-dead templating system in 1998 or my own test framework in 2000 or even Test::Builder in 2002 in no way obligates me to neglect mowing my lawn in favor of adding a feature anyone requests in 2010.

You can tell. Read the disclaimer of warranty in the license. I hope my code is useful for you, and I intend it to be useful, but I can neither promise its utility nor its suitability for your purposes.

In return for the risk you take on using software written and maintained by someone as capricious and unpredictable in schedule and interest as myself, you get its complete source code, you get (in many cases) read access to the repository where I develop the work, access to the bug tracker and mailing list and forums where I discuss the work, and you get the right to fork and maintain it yourself.

In my mind, that's a fantastic trade to make. The Perl community has survived deaths, job losses, family crises, births, flamewars, resignations, forks, other languages, trolls, test failures, catastrophic installation failures, and even ExtUtils::MakeMaker.

Community-driven software means we don't have to suffer the whims of profitability or market changes or personnel changes or trade secrets or market segmentation or duplication or competition. It means that we, collectively, have the power to make our lives easier with software. Sometimes that means changing how we develop software. Sometimes that means changing how we support software. In this case, I believe in means changing our will: no longer should we act as if software is a resource produced by someone else and we are mere consumers. We should act as if we are equals in producing software—because we are.

Perl and the Least Surprised


Imagine you're the language designer today. What should this code produce?

my $x = 1;
my $y = 2;

print $x + $y;

What should this code produce?

my $x = 'hello';
my $y = 'world';

print $x + $y;

Easy, right? It gets more fun:

my $x = '10';
my $y = '20';

print $x + $y;

How about:

my $x = '99';
my $y = 'bottles of root beer on the wall';

print $x + $y;

If I were cruel, I'd suggest an example such as:

my $x =   10;
my $y = 0.10;

print $x + $y;

After all, consistency is important.

Here's something even stranger:

my $x = (77, 'seventy seven')[rand 2];
my $y = (99, 'ninety nine')[rand 2];

print $x + $y;

If you object to non-deterministic static typing in tuples, consider:

my $x = readline();
my $y = 'is my extension';

print $x + $y;


my $x = readline();
my $y = 99;

print $x + $y;

Or (because you can argue that simple variable interpolation is syntactic sugar for repeated string catenation):

my $x = readline();
my $y = readline();

print "#$x is $y's jersey number";


my $x = readline();
my $y = readline();

print $x + $y;

Without manifest typing, how do you design your language for the least amount of surprise in these cases? More importantly, after you give all of your answers, ask yourself "Why?" and don't stop with "Language X does it this way, which is obviously correct," as the designers of Language X had to ask these questions too.

(For bonus points, defend the thesis "Strings are obviously arrays of characters!" in light of the polymorphic overload of addition in the integer/rational case.)

A Modern Perl Fakebook


You don't really know a language until you understand its libraries.

You can learn Smalltalk's syntax in an afternoon, but you won't be able to do much with it until you learn its idioms and how it's organized and what's where and how to use it.

You can dabble in a hundred languages—a new language a month for the rest of your professional life—but if you can't write anything more substantive than "Hello, world!" or the first program I ever wrote on my own, you don't really know them:

30 PRINT "  "
50 GOTO 20

(I haven't proven that program bugfree or idiomatic, merely correct.)

Understanding how a language works is necessary to programming well in that language, but it is not sufficient. I read the Camel book cover to cover, but I had to read the Perl Cookbook before I could program Perl practically. (I also needed a couple of years of experience before I could call myself an adept programmer, but the only way to get that experience is through experience.)

As Simon Cozens wrote in the introduction to the second edition of Advanced Perl Programming, "advanced Perl programming has become more a matter of knowing where to find what you need on the CPAN, rather than a matter of knowing what to do." His edition reflected that, and I've used that book on two projects this month. Unfortunately, the book is five years old and could only cover a fraction of the CPAN.

It's no secret I'm looking for projects after Modern Perl: The Book comes out. I'm trying to convince Stevan Little and Chris Prather to write a book about the obvious (and if you'd like to read it, please encourage them to do so). I think it's time for a new book on Perl and testing.

I also keep thinking of a bigger goal. The Perl Cookbook is seven years old and doesn't even cover Perl's testing revolution. Don't expect a third edition soon or ever (sic transit gloria animaila libri).

Yet what if there were a wiki of modern Perl idioms or modern Perl solutions which focused on the use of CPAN distributions and generally stayed up to date with both new versions of Perl and new software as it came out? Think of it as an expanded version of Task::Kensho with explanations and sample code crosslinked and organized by topic. I'd use such a resource, if it existed.

I also admit, I'd love to publish a book drawn from that wiki. I'd happily edit the prose. I'd done all author royalties to TPF, and I'd even make a raw PDF available. This author model worked well for other books such as the Python Cookbook. Would anyone else like to see it for Perl?

In another life, I edit novels. I annoy friends and family by pausing the DVR or the DVD, pointing at the screen, and saying "Notice that? Here's how the show will end." I cut adverbs. I tighten dialog. I laugh when the robot devil in the final broadcast episode of Futurama complains that Fry's expository dialog makes him feel so angry.

Then I write about Perl, I read about Perl, I write Perl, I maintain Perl, I edit Perl, and I edit writings about Perl, and bless everyone's heart for participating in Perl Iron Man and publishing their code to the CPAN and tweeting and denting and social networking about how we're getting great new features and regular releases and interesting tools and better frameworks and more documentation (and, yes, fresh books again)... but if I never read another "Perl Isn't Dead" or "Perl; Still Alive" post, that's really okay.

The robot devil has a point. It's clumsy to announce how your characters feel:

"I'm angry," said Angry Bob, angrily.

It's more effective to demonstrate how they feel:

Bob's eyes narrowed and Glenna had to strain to hear his tense whisper. "You think you're so smart, but you don't know anything about me."

We can all write "We do cool things with Perl all day, really! Really really!" but it's more effective to show the cool things we do with Perl. It's more effective to demonstrate that Moose and MooseX::Declare are powerful, effective technology. It's easy to point to 28 (almost 29) releases of Rakudo that you can download and install now and say "Perl 6? Yeah, you can use it today." It's fantastic to send someone a link to Strawberry Perl and say "All of CPAN is available for you right now."

Any one of those things is worth ten press releases that we're still around in our little corner of the programming universe, because we're supposed to be pragmatic and practical programmers who spend our time Getting Stuff Done. Buzz is nice and buzz is good, but I'll take an Adam Kennedy or an RJBS or a Tatsuhiko Miyagawa or a Karen Pauley or an Ingy over a thousand press releases any time, because I can point to almost anything any of those people have done as evidence of a brilliant, vibrant, active community.

Emphasizing that makes me happy.

I'm working on a small project today. Part of that project requires fetching syndication feeds and enqueueing further work if those feeds have new items. That means detecting whether those feeds have new items, and it also means polling the sites with those feeds frequently.

These are simple, well-understood problems, with well-understood solutions.

I don't want to poll sites more frequently than they allow, so I'm happy to use LWP::RobotUA to fetch the feeds, as it respects the robots.txt protocol for well-behaved spiders.

I also want to skip processing if the feeds remain the same between fetches, so I want to use LWP::UserAgent::WithCache, which checks HTTP headers such as Last-Modified/If-Modified-Since and ETag/If-None-Match for modifications.

Unfortunately, both are subclasses of LWP::UserAgent, and both expect to be at the same level of a complex inheritance hierarchy which forms all of LWP in Perl.

Here is the object lesson for people desigining software. If you intend other people to reuse your software as components, such that you can't predict how other people will use it, remove as many unnecessarily hard-coded dependencies as possible.

If I were to redesign this part of LWP, I'd make the caching behavior and the robots.txt-respecting behavior into separate behaviors, perhaps runtime roles. I'd rewrite the LWP::UserAgent constructor to use a plugin system, where instantiators could provide an optional (and ordered) list of behaviors with which to decorate the $ua object. Obviously the behavior I need is first to check the cached copy and then check the robots.txt rules and then use normal HTTP access, but why hard-code these behaviors?

There are plenty of mechanisms (CLOS method modifiers, the Decorator pattern, dependency injection) to work around this problem, but for now my solution is to subclass LWP::UserAgent::WithCache, override its constructor, and manually inherit from LWP::RobotUA.

(For all of the faults of Java's IO model, it handles this problem well. Its defaults are awful, and it exposes too much complexity, but the Decorator pattern works effectively. PerlIO works in a similar fashion with much better defaults. This HTTP fetching problem is in the same category; note how a similar model could handle proxying, compressed output, anonymizers, and filtering with ease.)

Warning, philosophy ahead!

I alluded to this question when I asked should the Modern Perl book prefer cpanminus?. I've tried to explain my goal a couple of times in private, but I've never done so systematically, and I've never invited wide discussion.

I'm trying to figure out the right audience for the Modern Perl book in preparation for publishing this summer. My initial idea to write the book came from two places.

First, the Camel book is a decade old, and there'll never be a new edition which covers Perl 5. Despite the fact that fourteen (almost fifteen) stable releases of Perl 5 have come out since then (and at least four of them, possibly five) count as major, the canonical printed language reference is out of date and, at this point, all but abandoned.

Second, the best explanation of JavaScript I've ever encountered is JavaScript: The Good Parts. Like almost any other language, JavaScript has some good ideas, some poor implementations of good ideas, and some bizarre ideas that you shouldn't ever use. In 176 pages, Crockford explains how the language works and how its pieces fit together. Without covering everything, you can go from knowing how to dabble with it to writing good code.

Yet you're not an expert. You understand enough of the theoretical underpinnings of the language and the practical issues of using it to be productive, to use it to its advantages, and how to avoid or at least work around misfeatures.

I want to do something similar for Perl 5. I believe that understanding how to use perldoc is essential to programming Perl well, as is understanding the two forms of context and how they influence other code, as is understanding Perl's operator-oriented container-based type system.

I believe it's possible to explain how Perl 5 works in a couple of hundred pages, such that someone who's worked through a tutorial or two on setting up a Perl development environment and written something more than "Hello, world!" can understand Perl and continue to learn and to become productive. If you add in permission to experiment with small snippets of code, there are few limits as to where readers can go.

In short, I want to produce a book you can hand to someone who says "Perl? Oh, I've played around with it a bit." and tell them "Once you've read and understood this, you'll understand Perl."

I think I can do that without also bearing the burden of teaching people how to program in general. I assume that pointing interested novices to tutorials to set up Strawberry Perl and/or Padre is sufficient explanation for the basic material.

What do you think?

Package BLOCK for 5.14


Zefram recently posted a patch to add package BLOCK syntax to Perl 5. If this were available now in 5.12 (or 5.10.1), I'd use it in every class or module where I can use a modern Perl.

This is a (reasonably) small change to the Perl 5 parser, as far as those things go. A few introspection portions of Perl 5 need corresponding changes, such as B::Deparse and the MAD annotations that few people besides Larry have understood. Yet it's beneficial as it is for three good reasons and one great reason.

First, a package with an explicit, brace-delimited scope allows readers to see where a package begins and ends. This is my sole problem with Python's whitespace: the use of vertical whitespace to denote the ends of blocks, of functions, and of classes. Using invisible characters to mark the beginnings and ends means that people (let alone parsers) sometimes have to guess, and people can guess wrong. I use vertical whitespace to separate sub-units within larger sections of code.

Schwern has a code-skimming technique where you scale down your font size and look at the program as a whole. This makes the structure of the code visible without distracting you with details of any unit of code. Languages which allow you to organize code into functions or classes or modules expose details of your design when viewed this way.

The package BLOCK syntax would make the organization of code more obvious in this respect.

Second, the package BLOCK syntax clearly delineates a scope. Any entity defined within that block belongs to that package, and any referred to within that block is visible within that package. This is a subtle point, but it's also the important design reason to use the package keyword rather than requiring an explicit namespace on functions and methods:

sub My::Class::method

(Tedium also argues against that approach, but too few programming languages aggressively optimize against tedium. Perl 5 doesn't go far enough in that respect.)

This delineation is more than mere syntax. It's also visual. It suggests an encapsulation. Inside these curly braces is different from outside.

Third—and related, you prevent unintentional sharing of lexicals and other variables. Because package scoping rules and lexical scoping rules are orthogonal, it's easy to close over lexical variables declared at the top of a file even if you've switched packages several times.

Sometimes that's intentional, yet it's often enough a mistake that some people recommend starting your programs with:

#! perl

use Modern::Perl;   # or do the pragma dance yourself

sub main

They're right, too. Defensive coding sometimes requires you to change your habits to eliminate the possibility of making certain kinds of mistakes. Explicitly delimiting the lexical scope of a package means that lexicals can't leak into other packages by accident.

(Some people might argue that you can avoid these problems by adhering to a "one package per .pm file" policy, and that's true. It's also not always possible. Helper packages or inner classes or other entities encapsulated by their design deserve to stay encapsulated.)

You can get all three of these features now by writing:

    package Foo 1.23.45;

    package Bar 9.87.65;

    package Baz 0.07;

... but aesthetics argues against this form. So does linguistics. The important data for the unit as a whole is its name and its version, and that belongs at either the start or the end of the unit. For the same reason unit of smaller organizational granularity put their metadata sections before their block. Compare:

sub foo :attr

... with:

    sub foo :attr;

... or even:

while (1)

... with:

} while (1);

(I suspect that part of the distrust for do/while is linguistic distaste for inverted end weight, though whether that's due to familiarity with Algol-style control structures or a dislike for backreferences in code is for graduate students to debate.)

Without a change to Perl 5's syntax like Zefram's patch (or a grammar-mutating module built around Devel::Declare), it's impossible to attach the package's metadata where it belongs: at the start of the block.

Consistency suggests it. Linguistics suggests it. Correctness recommends it. I'd use this feature today, if it were available. I'd like to see it in Perl 5.14 next spring.

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

sponsored by the How to Make a Smoothie guide



About this Archive

This page is an archive of entries from May 2010 listed from newest to oldest.

April 2010 is the previous archive.

June 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Powered by the Perl programming language

what is programming?