On Parsing Perl 5

For some reason, a PerlMonks post from last year, Perl Cannot Be Parsed, has resurfaced in online discussions.

As you might expect, most comments are ill informed and blatant in their wrongness.

At least half of the comments are the equivalent of "OH noes i cantn't read PERL LOLLERS now i is nos Y!!!", but that's what you get for reading Twitter, Digg, YouTube, and occasionally Reddit. (If no one's invented the phrase "The Eternal 4channing of the Internet", I claim it.)

Another significant portion of the comments come from people who've read only the headline and feel sufficiently enlightened that they can proclaim their wisdom to the world at large, despite comments and clarifications on that PerlMonks post from people who've actually worked on the Perl 5 internals and understand the implications of that statement and their limitations.

A few comments are from people who know what this means and what it implies and those (few) cases where it applies.

I suspect the rest of the programming world which might care persists in a limited state of confusion. What does this mean? Why does it matter? What happens now? This is especially true for people who use Perl but don't necessarily consider themselves programmers. They don't have computer science backgrounds, but they'd use good tools like Perl::Critic if they knew they were available.

On Static Parsing

The unparsability of Perl 5 is a static unparsability. "Static" is a specific term of art which means, basically, "You can't run any part of the program to determine what it means."

One of the most common points of confusion in the discussion is a misunderstanding of what "static" means. This doesn't mean that it's impossible to parse Perl 5. The files perly.y and perltoke.c in the Perl 5 source tree do just that. However, the existence of subroutine prototypes and bareword declarations combined with the possibility of running arbitrary code during compilation with BEGIN blocks introduces ambiguity to the process.

That's all the proof addresses.

I realize that's not obvious to everyone reading this so far. That's fine. Just remember that the proof addresses important but limited circumstances.

Why limited? It is possible to parse Perl 5 code statically in many cases -- most cases, even. PPI is an example of this. PPI can build a document object model representing Perl 5 programs sufficient to perform static analysis; this is why Perl::Critic works.

PPI isn't perfect, and it can't resolve all possible Perl 5 programs to unambiguous object models. Part of that is the malleability of Perl 5 syntax.

Changing the Parse

I mentioned subroutine prototypes and barewords; these are two cases where predeclaration changes how Perl 5 will parse existing code.

Subroutine prototypes can change the expected arity and parse and runtime behavior of subroutines. In general, you should avoid them unless you know exactly what this means and what it implies.

Changing Expectations

In specific circumstances, they're useful to change how Perl 5 passes arguments to the functions. My favorite prototype lets you pass blocks as anonymous subs:

sub wrap (&) { ... }

my $wrapped = wrap { do_this(); do_that() };

Without the & prototype character, you'd have to call wrap() with more verbosity:

my $wrapped = wrap( sub { ... } );

It's a small prettification, but it's occasionally useful.

There are two important points to note about this example. First, Perl 5 must encounter the sub declaration with its prototype prior to the point at which it parses the call to wrap. The prototyped declaration changes how Perl 5 parses that construct -- otherwise the curly braces would run through the "is it a hash reference or a block?" heuristic and would not get promoted to an anonymous subroutine declaration.

Second, a smart and stateful static parser could identify this type of declaration and parse the code the same way the Perl 5 parser does. This is not by itself sufficient to make parsing Perl 5 statically indeterminate.

(Some people might object that statefulness in a static parser violates some rule of computer theory somewhere, which is a true point but practically useless; you need at least some statef in your parser to find the end of a double-quoted string if you allow escaped embedded double quotes. The question is not if you allow lookahead but how much lookahead you need. Also parser people, remember that I reserve the right to tell small fibs to explain things to an audience that won't ever take a class about automata.)

Changing Arity

Arity can be more interesting; this is a fancy programming term which means "how many arguments a function or operator takes". Perl 5 subroutines are usually greedy; they'll happily slurp up all arguments in a big list. Certain subroutine prototypes allow you to change this. For example, you can define a foo subroutine which only takes two arguments:

sub foo ($$)
{
    return "<@_>";
}

say foo 1, 2;

The prototype changes the parsing of foo calls so that providing more than two arguments is a syntax error:

say 'Parsing done';
say foo 1, 2, 3;
$ perl protos.pl
Too many arguments for main::foo at protos.pl line 13, near "3;"
Execution of protos.pl aborted due to compilation errors.

Again, Perl 5 must have encountered the prototyped declaration before it can apply the parsing rule changes to any code. Again, a clever and stateful static parser could detect these parser changes and adapt to them.

Barewords

The Perl 5 parser has a complex heuristic for figuring out what a bareword actually means. It could be a filehandle, a class name, a function, or more. Several good Perl 5 programmers recommend against the indirect object invocation syntax (though dative is a description with more technical accuracy for linguists) because it can parse differently depending on code Perl 5 has already executed.

my $obj = new Class foo => 'bar'; # do not mimic this example

Perl 5 has heuristics to determine what new and Class mean in this context. It may be obvious to readers that this code really means:

my $obj = Class->new( foo => 'bar' );

... but if the parser has already seen a subroutine declaration for Class in this particular namespace (in certain circumstances, I believe it's possible to fool the parser with a specific way of declaring new), the heuristic may choose the wrong option.

The indirect pragma is an effective way to root out this ambiguity from your code. There are some seven cases in which you can run afoul of this bareword heuristic; you can tell who's a real Perl 5 guru by asking him or her to list at least five. See Matt S. Trout's Indirect but Still Fatal for a fuller discussion of the issues here.

A very clever, very careful static parser could still replicate the complex heuristic in the Perl 5 parser and parse code appropriately.

The Real Problem

What's the source of the ambiguous parsing issue?

If Perl 5 requires predeclaration of all parse modifications, and if they're declarative, why is writing a static parser which can parse all Perl 5 programs in the same way that the Perl 5 parser does impossible?

Perl 5's compilation phase parses Perl 5 code into a parse tree. The execution phase traverses this tree. They're two separate stages...

... except that BEGIN and implicit BEGIN blocks can run arbitrary code during the compilation phase.

That is, you can write code that will not compile successfully:

sub foo ($$)
{
    return "<@_>";
}

say foo 1, 2, 3;

... and insert a BEGIN block to change the way Perl 5 will parse subsequent code:

sub foo ($$)
{
    return "<@_>";
}

BEGIN
{
    no warnings qw( prototype redefine );
    eval 'sub foo { return "<@_>" }'
}

say foo 1, 2, 3;

Even though the eval code wouldn't normally run until after the compilation stage has finished and the execution stage has begin, the BEGIN block causes the Perl 5 parser to execute its contents immediately after its parsing has finished -- before parsing the very next line.

This is a contrived example -- why would anyone write this code? -- but you should start to see the problem.

Even though this syntax is declarative to the Perl 5 parser, a static analysis tool would have to analyze the semantic meaning of code within the BEGIN block to determine if it has any effects. That means peeking into double-quoted strings used as arguments to eval.

Even if a static parser could do that, it'd be easy to confuse it again:

BEGIN
{
    no warnings qw( prototype redefine );
    eval 'sub' . ' foo ' . '{ return "<@_>" }'
}

Even though Perl 5 can coalesce those three concatenations to a single constant string during its optimization phase, the parser must get more complex to detect this case.

You can see where this is going.

Ultimately, completely accurate static parsing of Perl 5 programs is impossible in certain cases because I could write:

BEGIN
{
    no warnings qw( prototype redefine );
    eval 'sub' . ' foo ' . '{ return "<@_>" }'
        if rand( 1 ) > 0.5;
}

The static analysis tool can't even tell if this program would parse successfully. About half the time, Perl 5 will report that it has a syntax error. Half of the time it will succeed.

Thus there exist pathological cases written by curious (and potentially malicious) people such as myself where a static Perl 5 parser cannot predict what the actual Perl 5 parser will do. If you're morbidly curious, consider changing the prototype of a symbol already parsed through raw typeglob access.

The Implications

Don't panic. Static analysis tools can't parse all valid Perl 5 programs, but they can parse many Perl 5 programs. While the Perl 5 parser is complex with many special cases, it's possible to write a clever tool such as PPI which is good enough for almost all code. Besides that, the language-bending manipulations that could trip up a static analysis tool are mostly declarative and mostly decidable, unless you're doing something beyond weird.

In most cases this doesn't matter. That example code is completely unreliable. There are ways to disambiguate potentially ambiguous code -- and good Perl 5 hackers recommend using those approaches anyway. Most languages without a very strict homoiconic representation have parsing ambiguities (and if you allow arbitrary code execution in templates or reader macros, even homoiconicity can't always save you).

Besides that, once you allow loading language extensions from languages with access to raw memory, all bets are off.

With a bit of work, I'm sure I could come up with ways to confuse static Ruby and Python parsers without resorting to C. Implicit variable declarations seems like one perilous approach....

5 Comments

hercynium.myopenid.com | August 14, 2009 2:51 PM

I admit right-up-front, this topic is likely way over my head, however, my understanding is that pretty much *any* language that allows the dynamic manipulation of either the parser or the parse tree will 'suffer' this unparseability problem.

Of course, one could build their parser to detect those manipulations and account for them, but that too seems to be impossible to do safely and properly. A simple example:



if x == foo {

   change_this_parse_rule()

}

else {

   change_that_parse_rule()

}

The parser would have to have to know the value of x to know which modifications would be made and in order to do that, it would have to execute the program up until that point.

I would imagine that if the goal is to only determine how a program would be parsed then executing parts of it would be highly undesirable. Come to think of it, it defeats the 'static' part of static analysis. Never mind the can 'o worms that eval opens up! Seeing as *many* different languages seem to have these capabilities I seriously doubt that static unparseability is unique to Perl 5.

chromatic | August 14, 2009 2:59 PM

@hercynium, your understanding is correct. Any language which allows the execution of arbitrary code (with statically undecidable inputs, likely) will prevent perfect static code analysis.

Yuval Kogman | August 14, 2009 6:11 PM

The second law of internet dynamics states that the 4channing of an isolated website which is not in equillibrium will tend to increase over time. Case in point, LtU.

I seriously doubt that static unparseability is unique to Perl 5.

Lisp and its deriviatives, Forth, and related (but not technically unparseable) ambiguities exist in C and C++. See the LtU link for more discussion if your horse is not dead yet.

Yuval Kogman | August 14, 2009 7:01 PM

The second law of internet dynamics states the 4channing of an isolated website which is not in equilibrium will tend to increase over time. case in point, LtU.

I seriously doubt that static unparseability is unique to Perl 5

Correct. Lisp and derivatives, Forth, to name a few. C and C++ suffer from similar ambiguity in their grammars, though they are still statically parseable. You don't need to know how to run C to parse it, but you need to know which symbol in what context means what.

Aristotle Pagaltzis | August 15, 2009 5:12 AM

Would you mind amending the article, where you write “For example, you can define a foo subroutine which only takes two arguments”, to point out that this isn’t the only thing that the prototype does, and that the other things it does are almost always unwelcome?

Otherwise, someone who doesn’t know what prototypes actually do might get the impression that they’re actually useful. I always worry about that when I talk about prototypes, so I’m very careful to preface all instances where I mention them with an explicit caution to stay away.

On Static Parsing

Changing the Parse

Changing Expectations

Changing Arity

Barewords

The Real Problem

The Implications

Tags:

5 Comments

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry