Extract Your Traversal

One reason I hate writing parsers is that I hate writing traversal code. I know how to make a tree-like data structure. I know how to traverse that data structure depth-first and breadth-first. (I even understand that a tree is a specific case of a graph and I've traversed graphs.)

These problems are math problems, and having solved them once, I've satisfied my curiosity and would rather be off solving new problems.

Last week, Allison offered me a bug. "Here's an HTML stripper," she told me. "It doesn't handle nested tags correctly."

The correct way to process a language of arbitrary complexity with effectively infinite nesting is to parse it, rather than using regular expressions. (There's math behind that as well. I've done it to my satisfaction. If you want to learn more, look into the theory of grammars and natural languages and automata.) Parsing means using a parser.

Parsing often means creating a graph-like data structure. In the case of HTML, if it's anything even close to well-formed, it's a tree with a single root element and an arbitrary amount of children. To find only the textual content—to remove all markup correctly—you have to traverse the tree from the root to the leaves. You can't assume, for example, that you'll only ever have to turn something like:

<p>Hello, world!</p>

... into "Hello, world!", because as soon as you do that, a user will track chaos and entropy all over your clean floor by expecting you to turn:

<p>Hel-<em>LO</em>, world!</p>

... into "Hel-LO, world!", which is the bug Allison asked me to fix. (The fact that the naïve parser we had produced "Hel-, world!", not that users make demands. I don't have the source code to the human race in the preferred form to make and distribute modifications.)

The code had a nice comment in it that said, in short, "Yes, this doesn't handle nested tags. You'll have to use a more complicated recursive solution." At least it used HTML::TreeBuilder and had that comment.

As you can guess from the name, HTML::TreeBuilder turns a flat wad of text in HTML format into a tree. Then you can walk the tree, inspecting each node all the way from the root to the leaves.

That means recursion.

(The problem with the buggy code is that it had a hard-coded limit of looking only at the direct descendants of the root node. I could have made it look two levels down, but what if users nested tags even further? You end up with code indented so far to the right that not even your modern widescreen laptop will save you from all of the whitespace on the left, or you realize that that's why recursion exists, or someone hands you The Little Schemer and all of a sudden you get the functional enlightenment and vow to learn Emacs for real and when you wake up in a cold sweat a week later, you're better off for it.)

I hate writing recursive code to handle parsing and data transformation. It's always an exercise in writing code that's so very similar to the last code I wrote to do almost the same thing that it triggers my "Why don't you just automate this once and for all and be done with it?" guilt center, as irrational as I know that feeling to be. I've used the Visitor Pattern sometimes and I've used roles to compose in visitor methods, but I've never found an approach I don't end up hating a little bit.

In the spirit of the original code, I decided to embrace limitations (after all, our users are great about filing bugs) and handle only two tags explicitly. Paragraph tags are obvious. Text inside paragraph tags should appear pretty much verbatim, and they need blank lines between them. Hyperlinks need a little more formatting; they ought to appear on their own lines, to make them easier to click.

Any other tag needs its textual components extracted, and we can worry about things like tables and weird spans later. (If that ever occurs, our users will file bugs.)

Because every tag may have nested content, the part of the parser that handles the content must be able to descend into any taggy children to look for hyperlinks or paragraphs or text.

Then inspiration struck. (I've written this kind of code far too many times. Perhaps it wasn't insight. Perhaps it was repetition fatigue.) You traverse the children of an HTML::Element node (returned from HTML::TreeBuilder) with something like:

    $traverse = sub {
        my $node = shift;
        my $text = '';
        for my $child ($node->content_list()) {
            if (ref $child) {
                my $tag    = $child->tag();
                my $action = $actions{ $tag } || $traverse;
                $text .= $action->( $child );
           } else {
                $text .= $child;
           }
        }

        return $text;
    };

In plain English, the nodes returned from an Element's content_list() method are either simple scalars of textual content or HTML::Element objects themselves. You gather up the text and return it and recurse into the objects to find their textual contents, and on and on until you've visited the leaves.

... but as you can see in that code, it doesn't actually do anything other than gather text. It just recurses, and it looks in a hash called %action. The complete code is:

sub html2text {
    my ( $html ) = @_;

    my $tree = HTML::TreeBuilder->new_from_content($html)->elementify();
    my $top  = $tree->find_by_tag_name( 'body' ) || $tree;

    # declare here to close over in hash
    my $traverse;
    my %actions = (
        p => sub {
            my $text = $traverse->( shift );
            return '' unless $text;
            return $text . "\n\n";
        },
        a => sub {
            my $node = shift;
            my $text = $traverse->( $node );
            my $link = $node->attr( 'href' );

            return $text . ":\n\n" . $link . "\n\n";
        },
    );

    $traverse = sub {
        my $node = shift;
        my $text = '';
        for my $child ($node->content_list()) {
            if (ref $child) {
                my $tag    = $child->tag();
                my $action = $actions{ $tag } || $traverse;
                $text .= $action->( $child );
           } else {
                $text .= $child;
           }
        }

        return $text;
    };

    my $text = $traverse->( $top );

    $text =~ s/^\s+//g;
    $text =~ s/\s+$//g;

    return $text;
}

You can see the two special rules for handling paragraphs and hyperlinks. Anything else gets treated as plain old boring tags which may eventually contain text nodes. All of the recursion is in $traverse. Handling a heading is as easy as adding another entry to the hash. As long as that entry calls $traverse, the contents of that heading get handled correctly, without me having to write any more recursive code.

I like not worrying about me or someone else getting the recursion and its end cases correct. (I wrote it wrong the first time, in fact.) I really like having the default case do the right thing for almost everything. I very much like having the mechanism of the recursion separate from the formatting.

Mark Jason Dominus makes the point in his Higher Order Perl that this sort of code is the bread and butter of computer science, and it's the sort of thing that a great computer science education (did I mention Scheme before and Structure and Interpretation of Computer Programs?) will teach you. He also says that Perl's syntax is uglier than it needs to be.

I hate disagreeing with MJD, because he usually ends up right, but I don't mind the look of this code all that much. I prefer the anonymous functions to the structure of setting up and naming functions; that seems to me to spread out the essential behavior, whereas this code keeps a data structure near a traversal function in a pleasing way.

With that said, I can think of at least two ways to improve this code, or at least write it differently. One of them requires Perl 5.16, but the other requires you to rethink control flow. More on both later.

Tags:

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry