Defending Against Its Dynamic Scope

| 11 Comments

A lot of the advice given in Modern Perl: the Book is advice learned the hard way, whether through making my own mistakes, debugging code written on my teams, or reviewing code written by other people to help novices become better programmers. After a decade-plus of this experience, I think I've developed a good sense of what people find confusing and what problems rarely occur in practice.

The pervasive use of global variables? It'll eventually catch up to you. Variables popping into existence upon use, not declaration? It'll cause problems far sooner than you ever expect.

Clobbering $_ at a distance inside a map block? It happened to me the other day. Yes, it surprised me too.

I've been attaching the Perl search bindings Lucy to a document processing engine. As part of the processing stage, my code adds documents to the search index. The index schema keeps track of specific fields, and it's denormalized slightly to decouple the document database from the search index. The code to add a document to the index creates a hash from method calls on each document object. That's where things started to go wrong:

sub add_entries
{
    my ($self, $entries) = @_;
    my $index            = $self->index;

    for my $entry (@$entries)
    {
        my $fields =
        {
            map { $_ => scalar $entry->$_() } keys %index_fields
        };

        $index->add_doc( $fields );
    }

    $index->commit;
    $self->clear_index;
}

I noticed things went wrong when my test bailed out with strange errors. Lucy was complaining about getting a hash key of '', the empty string. I was certain that %index_fields was correct.

While most of the methods called are simply accessors for simple object properties, these document objects have a method called short_content():

sub short_content
{
    my $self    = shift;
    my $meth    = length $self->content > $MIN_LENGTH ? 'content' : 'summary';
    my $content = $self->$meth();

    return unless defined $content;

    my $splitter     = Lingua::Sentence->new( 'en' );
    my $total_length = 0;
    my @sentences;

    for my $sentence ($splitter->split_array( $content ))
    {
        push @sentences, $sentence;
        $total_length += length $sentence;

        # must be a big mess, if this is true
        return $self->summary
            if  @sentences    == 1
            and $total_length > $MAX_SANE_LENGTH;
        last if $total_length > 0.65 * $MAX_LENGTH;
    }

    if (@sentences)
    {
        my $text = join ' ', @sentences;
        return $text if length $text > $MAX_SENTENCE_LENGTH && $text =~ /\S/;
    }

    return substr $content, 0, $MAX_SHORT_CONTENT_LENGTH;
}

A document may have a summary. It has content. short_content() returns a slice of the first significant portion of either, depending on which exists. While it's not the most detailed portion of the document, it's the earliest significant portion of the document, and it's demonstrably the best portion to index as a summary. (Thank you, inverted pyramid.)

The rest of this method attempts to break the short content at a sentence boundary, so as not to cut it off in the middle of a word or thought.

Nothing in that method obviously clobbers $_, but something called from it apparently does. (I wonder if Lingua::Sentence or one of its dependencies reads from a file or performs a substitution.) Regardless, I protected my precious hash key with a little defensive programming:

        my $fields =
        {
            map { my $field = $_; $field => scalar $entry->$field() }
                keys %index_fields
        };

... and all was well.

While this has been a very rare occurrence in 13+ years of Perl 5 programming, the trap is particularly insidious. The more work Perl does within that map block, the greater the potential for action at a distance. Furthermore, the better my abstractions—the more behavior hidden behind that simple method call—the greater my exposure to this type of bug.

Throw in things like dependency injection and Moose lazy attributes where you don't have to manage the dependency graph of your data initialization and flow yourself (generally a good thing) and you can't tell what's going to happen where or when.

If my instincts are correct, and something reads from a file somewhere such that only the first call to construct a Lingua::Sentence object clobbers $_, the point is doubly true.

(Sure, changing map to autolexicalize $_ would fix this problem, but it has backwards compatibility concerns for people who rely on this action at a distance—remember, it's only a problem if you write to $_—and it's too late to get such a change into Perl 5.16, even if it were only enabled with use 5.016;. Meanwhile, a variable assignment fixes it for me right now, and that will have to suffice.)

11 Comments

Or a bugfix to the module that's clobbering $_.

I could tell from reading your post that you know exactly where $_ is being globbered :) You're very diplomatic; I like it!

That's pretty weird. Given that => is just a synonym for the , operator, and the comma operator "[i]n list context, [is] just the list argument separator, and inserts both its arguments into the list. These arguments are also evaluated from left to right", I'd expect the $_ => ... to be evaluated before the $entry->$_()

But it's not, and very strangely so:

perl -MModern::Perl -e 'sub f { say "_ is $_"; $_++ }; $_ = 0; say join( ", ", $_, f(), $_, f(), $_, f(), $_, f() );'
_ is 0
_ is 1
_ is 2
_ is 3
4, 0, 4, 1, 4, 2, 4, 3

It works the same whether you use $_ or $x: the variable is evaluated at the end of the expression.

Using tie to see what runs when is illustrative:

perl -MModern::Perl -e 'package Tied; my $x = 0; sub TIESCALAR { bless {}, "Tied" } sub FETCH { say "FETCH $x"; $x } sub STORE { my ($this, $value) = @_; $x = $value; } package main; my $y; tie $y, "Tied"; sub f { say "y is $y"; $y = $y + 1; }; sub n { print "\n"; () } $y = 0; say join( ", ", $y, f(), n(), $y, f(), n(), $y, f(), n(), $y, f(), n() );'
FETCH 0
y is 0
FETCH 0
FETCH 1

FETCH 1
y is 1
FETCH 1
FETCH 2

FETCH 2
y is 2
FETCH 2
FETCH 3

FETCH 3
y is 3
FETCH 3
FETCH 4

FETCH 4
FETCH 4
FETCH 4
FETCH 4
4, 1, 4, 2, 4, 3, 4, 4

As we can see, the value of $y is fetched four times, right before running join, not interleaved with the calls to f().

... Or was that something that everyone but me already knew? :)

In your example I think you'll find that the fetches are happening from within the join function rather than right before it. The function will get the variable not the value of the variable (but it will get the result of those f(), n() functions first).

In a simple map such as... map { $_, f($_) } ... it is less obvious but is the same situation: the actual map block is returning a list containing $_ (rather than the value of $_), and whatever happens after that is fetching the value.

thirty5penguins: Indeed, some experimentation shows you've got it exactly right. Thanks for the explanation, it all makes much more sense now.

(actually in the map example I suppose it is the implicit return that fetches the value)

Want something weirder? Interpolate $_ into its own string as the hash key and the clobbering still happens.

So, has a ticket been filed, or the author notified yet?

...according to the docs, an alternative tracker is preferred to RT, so I suppose on the next update, the meta data could be added to tell search.cpan.org where the preferred tracker is.

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

sponsored by the How to Make a Smoothie guide

Categories

Pages

About this Entry

This page contains a single entry by chromatic published on April 9, 2012 11:37 AM.

Perl and that Dirty Word was the previous entry in this blog.

use_ok() is Broken Because require() is Broken is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.


Powered by the Perl programming language

what is programming?