<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Modern Perl Books for modern Perl programming</title>
    <link rel="alternate" type="text/html" href="http://www.modernperlbooks.com/mt/" />
    <link rel="self" type="application/atom+xml" href="http://www.modernperlbooks.com/mt/atom.xml" />
    <id>tag:www.modernperlbooks.com,2009-01-23:/mt//1</id>
    <updated>2012-05-16T23:39:21Z</updated>
    <subtitle>To solve a problem now, reach for Perl. To solve a problem right, reach for Modern Perl.
</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type 4.23-en</generator>

<entry>
    <title>Time Will Tell</title>
    <link rel="alternate" type="text/html" href="http://www.modernperlbooks.com/mt/2012/05/time-will-tell.html" />
    <id>tag:www.modernperlbooks.com,2012:/mt//1.450</id>

    <published>2012-05-16T22:28:41Z</published>
    <updated>2012-05-16T23:39:21Z</updated>

    <summary>The May 2012 Dr. Dobb&apos;s interview with Ward Cunningham has an interesting quote about Ward&apos;s notion of technical debt: I was really devoted to finding great code, especially when objects were new. Objects gave us an extra dimension beyond functional...</summary>
    <author>
        <name>chromatic</name>
        <uri>http://www.wgz.org/~chromatic</uri>
    </author>
    
    <category term="refactoring" label="refactoring" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="softwaredevelopment" label="software development" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="testing" label="testing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.modernperlbooks.com/mt/">
        <![CDATA[<p>The <a href="http://www.drdobbs.com/architecture-and-design/240000393">May 2012 Dr. Dobb's interview with Ward Cunningham</a> has an interesting quote about Ward's notion of technical debt:</p>

<blockquote><em>I was really devoted to finding great code, especially when objects were new. Objects gave us an extra dimension beyond functional decomposition. And the question was, "Are these the right objects or not?" And the answer was, "Time will tell."</em></blockquote>

<p>I work off and on with a handful of great programmers in the Portland area.
Several years ago, <a href="http://jamesshore.com/">James Shore</a> and <a
href="http://woldrich.com/">Dave Woldrich</a> created <a
href="http://cardmeeting.com/">CardMeeting</a>, an agile remote collaboration
tool. Jim and Dave are both very good programmers. For this project, they
decided to forgo their usual test-driven development and just write code so as
to deliver a working prototype on a vry strict deadline.</p>

<p>Jim took to calling that experience "leveraged technical debt". My estimate
(not having read the code, but having tested a lot of code written without
testing in mind) is that it takes at least as long to write tests for untested
code as it took to write the code <em>and much longer the more time has passed
between writing the code and writing the tests</em>.</p>

<p>Jim, Dave, and I have all worked on small, software-driven businesses doing
things we've never seen anyone else do before. We've all had to deal with the
risk of building lots of code that may or may not solve the problems of real
customers with real money. When I say <a
href="http://www.modernperlbooks.com/mt/2012/05/write-the-wrong-code-first.html">write
the wrong code first</a>, I don't mean "deliberately do things you know won't
work" or "paint yourself into a corner" or even "use the fact you don't know
everything you're doing as an excuse to play with completely new technologies
you don't know how to use". (Not that the latter is a bad thing, but if you
decide to do that, do so only after you've considered the risks and the
rewards.)</p>

<p>Last night, we had a short conversation with <a
href="http://johnwilger.com/">John Wilger</a>, another PDXer. He works with a
successful and relatively young startup with a huge software component. I don't
want to put words in his mouth, but it sounds like their software is,
colloquially, a mess. Their developer team is trying to get to the point of
slapping hands whenever someone needs to make a change and starts by copying
and pasting code.</p>

<p>Four years after founding (and two years after discovering its cash cow
business), the company was worth at least $3 billion.</p>

<p>It's irresponsible to derive meaningful statistics from a single data point,
but we can say this: the technical debt of their codebase didn't entirely
prevent the company from achieving its current measure of success. (You can
also say that the liberal application of candy-flavored magical unicorn
shavings of Ruby and Rails didn't prevent people from making an unholy
mess.)</p>

<p><em>Time will tell</em> if changing the development culture and refactoring
the code and paying down all of the technical debt will help the company adapt
and take advantages of new opportunities.</p>

<p><em>Time will tell</em> if the codebase collapses under its own weight.</p>

<p><em>Time will tell</em> if a competitor (and several exist!) will prove more
agile and nimble because it has much better flexibility thanks, in part, to
better code.</p>

<p>The whole situation reminds me of <a
href="https://www.facebook.com/notes/facebook-engineering/the-hiphop-virtual-machine/10150415177928920">Facebook's
HipHop virtual machine</a>, where it's apparently cheaper and easier and
faster and less risky to hire lots of developers to create and maintain a
compatibility layer for the existing code than to rewrite existing code in a
better language, or in a better fashion, or to improve it meaningfully.</p>

<p>I'm not suggesting that the only way to build a big business from nothing is
to write bad code. I'm not suggesting that scaling to billions in revenue is
the goal of all software-driven businesses. I'm not suggesting that you have to
choose between test-driven development and business success.</p>

<p>In an ideal world, I can write the right software the first time. I can have
sufficient test coverage to have complete confidence in the behavior of the
code. I can deliver a feature which gets me paying customers in an afternoon
without having to rewrite other parts of the code or taking shortcuts I know
that I'll have to clean up when I get a spare weekend afternoon.</p>

<p>For a profession where some of us call ourselves "engineers", we certainly
spend a lot of time discussing practical concerns as if the risks and rewards
and limitations of the real world did not apply. (I wonder if the
academic/practical divide between computer science and software development has
some relationship to this.)</p>

<p>In the real world, I have to remind myself every day when I'm working on
proof of concept code that proving my concept workable is more important than
solidifying my code into well-tested and well-designed software and when I'm
working on code I intend to keep that doing things as right as possible now
will help me modify it to get it more right in the future.</p>

<p>None of this guarantees success. All of this benefits from the hard-won
experiences I have from doing things the wrong way&mdash;and occasionally
getting it very right. (In the real world, I spent part of the day finding and
deploying a shim to turn SVG into VML for Internet Explorer 8 and earlier.)</p>

<p>Maybe Jim and Dave could have thrown out a couple of features and spent more
time writing tests for the most valuable parts of their application. Maybe I'm
wasting my time optimizing SQL queries for a search feature no one will ever
use. Maybe John's company waited too long to untangle the admin and the user
sides of their application.</p>

<p>If we're honest with ourselves, the best answer we can give is that time
will tell. May we pay attention when it does.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Separating Presentation from Content in Templates</title>
    <link rel="alternate" type="text/html" href="http://www.modernperlbooks.com/mt/2012/05/separating-presentation-from-content-in-templates.html" />
    <id>tag:www.modernperlbooks.com,2012:/mt//1.449</id>

    <published>2012-05-14T18:47:11Z</published>
    <updated>2012-05-14T19:10:14Z</updated>

    <summary>A couple of comments on Simple Attribute-Based Template Exporting have asked for an example. I&apos;ll show off more of this code in my YAPC::NA 2012 and Open Source Bridge 2012 talk about how to write the wrong code (along with...</summary>
    <author>
        <name>chromatic</name>
        <uri>http://www.wgz.org/~chromatic</uri>
    </author>
    
    <category term="modernperl" label="modern perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="perl" label="perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="templating" label="templating" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="webprogramming" label="web programming" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.modernperlbooks.com/mt/">
        <![CDATA[<p>A couple of comments on <a
href="http://www.modernperlbooks.com/mt/2012/05/simple-attribute-based-template-exporting.html">Simple
Attribute-Based Template Exporting</a> have asked for an example. I'll show off
more of this code in my <a href="http://act.yapcna.org/2012/talk/50">YAPC::NA
2012</a> and <a href="http://opensourcebridge.org/proposals/796">Open Source
Bridge 2012</a> talk about how to write the wrong code (along with a handful of
other techniques).</p>

<p>(I assume some knowledge of <a
href="http://search.cpan.org/perldoc?Template">Template Toolkit</a> (besides
far too many books about finance, accounting, and investing, the Template
Toolkit book is always within reach these days); I've set up a wrapper template
which provides the standard look and feel of my application and I
include/process other templates liberally. If you understand that much, you'll
be able to follow along.)</p>

<p>One of the interesting templates in the system displays a list of chapters
of a book in progress. A cron job rebuilds a static page from this template
once a day. The template looks something much like:</p>

<pre><code>[% USE Bootstrap -%]
[%- canonical_url = 'http://sitename.example.com/book/' _ link -%]

[%- add_og_properties({
    'fb:admins'      =&gt; '436500086365356',
    'og:title'       =&gt; title _ ' | sitename.example.com',
    'og:type'        =&gt; 'article',
    'og:image'       =&gt; 'http://static.sitename.example.com/images/logo.png',
    'og:url'         =&gt; canonical_url,
    'og:description' =&gt; text.chunk(300).0,
    'og:site_name'   =&gt; 'Sitename: site tag line',
   })
-%]
[%- add_meta(
    'pagetitle'     =&gt; title _ ' | sitename.example.com',
    'feed_url'      =&gt; 'http://static.sitename.example.com/book/atom.xml'
    'canonical_url' =&gt; canonical_url
) -%]

[% article_text = BLOCK -%]
&lt;article&gt;
&lt;h2&gt;[% title | html %]&lt;/h2&gt;
&lt;p&gt;Published: &lt;time datetime="[% date %]"&gt;[% nice_date %]&lt;/time&gt;&lt;/p&gt;
[% text %]
&lt;/article&gt;

&lt;ul class="pager"&gt;
[%- IF prev -%]
    &lt;li&gt;&lt;a href="[% prev.link %].html"&gt;&larr; [% prev.title | html %]&lt;/a&gt;&lt;/li&gt;
[%- END -%]
    &lt;li&gt;&lt;a href="/onehourinvestor"&gt;index&lt;/a&gt;&lt;/li&gt;
[%- IF next -%]
    &lt;li&gt;&lt;a href="[% next.link %].html"&gt;[% next.title | html %] &rarr;&lt;/a&gt;&lt;/li&gt;
[%- END -%]
&lt;/ul&gt;

[% INCLUDE 'components/social_links.tt', title =&gt; title %]
[%- END -%]

<strong>[%- row(
    maincontent( article_text ),
    sidebar(
        sideblock( process( 'components/cached/book_latest_chapters.tt' ) ),
        sideblock( process( 'components/cached/book_drafts.tt'          ) )
    )
) -%]</strong></code></pre>

<p>The emboldened lines are most important; they put all of the
<em>content</em> produced or assembled by this template in the HTML structure
the site needs. That is to say, everything on the site needs to fit into
something I call a <code>row</code>. A <code>row</code> can contain multiple
elements, such as <code>maincontent</code> and a <code>sidebar</code>, or
<code>fullcontent</code> by itself with no <code>sidebar</code>. A
<code>sidebar</code> can contain multiple <code>sideblock</code>s.</p>

<p>(You can ignore the other functions; they put metadata in the right places
to pass to wrapper templates.)</p>

<p>Within my template plugin (called <code>Bootstrap</code>), each of these
elements is a simple Perl function which takes one or more arguments and
interpolates it into some HTML:</p>

<pre><code>sub row :Export
{
    return &lt;&lt;END_HTML;
&lt;div class="row"&gt;
    @_
&lt;/div&gt;
END_HTML
}

sub sidebar :Export
{
    return &lt;&lt;END_HTML;
&lt;div class="span4"&gt;
    @_
&lt;/div&gt;
END_HTML
}</code></pre>

<p>(I initially tried to write these functions as templates within Template
Toolkit itself, but there comes a point at which you want a real language. That
point came very early for me.)</p>

<p>I lose no love over the <code>varname = BLOCK</code> pattern necessary to
populate variables to pass to these plugin functions, but it works for now. In
some of my templates&mdash;usually those with lots of text I might end up
changing later&mdash;I extract that text into a separate template under
<em>components/content/</em> to make it easy to edit. (This idea came up during
a client project where the client wanted to edit the legal clickthrough
arrangement after users create accounts. I didn't want lawyers or anyone to
have the ability to mess up the templating language, so I said "Edit this
single file as plain HTML and you'll be fine." It worked great.)</p>

<p>While my programmer brain says "This is ugly, and you're a horrible person
for committing this hack upon the world&mdash;you're calling Perl from your
template system to generate HTML you're stuffing into a template and that puts
your presentation elements in Perl code, you awful human being!", it keeps the
presentation code in a single place where I can update it infrequently (being
that I don't change the layout of the site dramatically) without having to
change the divs and classes of multiple templates.</p>

<p>I'm not arguing that this technique as expressed here is <em>right</em>.
It's probably not optimal; there may be easier approaches to achieve the same
effects.</p>

<p>I am saying that this currently works very well for me. I'm not typing the
same HTML over and over and over again, and I can tweak it much more easily
than I did before when I was refining the look and feel. In fact, I've even
<em>forgotten</em> the exact details of the layout, from the HTML/CSS point of
view, and now think only in terms of rows, maincontent, and sidebars.</p>

<p>Working abstractions are very nice.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Simple Attribute-Based Template Exporting</title>
    <link rel="alternate" type="text/html" href="http://www.modernperlbooks.com/mt/2012/05/simple-attribute-based-template-exporting.html" />
    <id>tag:www.modernperlbooks.com,2012:/mt//1.448</id>

    <published>2012-05-11T20:29:01Z</published>
    <updated>2012-05-11T21:33:18Z</updated>

    <summary>If you&apos;re like me and your design skills are sufficient to modify something decent to look nice but insufficient to create something from first principles, you can do a lot worse than to play with Twitter Bootstrap for your next...</summary>
    <author>
        <name>chromatic</name>
        <uri>http://www.wgz.org/~chromatic</uri>
    </author>
    
    <category term="cpan" label="cpan" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="modernperl" label="modern perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="perl" label="perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="webprogramming" label="web programming" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.modernperlbooks.com/mt/">
        <![CDATA[<p>If you're like me and your design skills are sufficient to modify something
decent to look nice but insufficient to create something from first principles,
you can do a lot worse than to play with <a
href="http://twitter.github.com/bootstrap/">Twitter Bootstrap</a> for your next
web site.</p>

<p>I've used it successfully for a few projects and it's been great.</p>

<p>It's a lot better now that I've written my own silly little <a
href="http://template-toolkit.org/">Template Toolkit</a> plugin to reduce the
need for writing lots of repetitive HTML in my templates. (It's like <a
href="http://haml-lang.com/">Haml</a> but less ugly and more Perlish and easier
to extend.)</p>

<p>Writing a TT2 plugin is relatively easy. Of course I do it the wrong way;
when you initialize your plugin, you have the ability to manipulate TT2's
stash. This is the data structure representing the variables in scope in your
templates. Where a well-behaved template should use object methods to perform
its operations, my code stuffs function references in the stash. Here's the
relevant code:</p>

<pre><code>sub new
{
    my ($class, $context, @params) = @_;

    $class-&gt;add_functions( $context );

    return $class-&gt;SUPER::new( $context, @params );
}

sub add_functions
{
    my ($class, $context) = @_;
    my $stash             = $context-&gt;stash;

    while (my ($name, $ref) = each %exports)
    {
        $stash-&gt;set( $name, $ref );
    }

    $stash-&gt;set( process =&gt; sub { $context-&gt;process( @_ ) } );
}</code></pre>

<p>I'll fix this eventually, but the process of making this work was
interesting.</p>

<p>In my first attempt (see <a
href="http://www.modernperlbooks.com/mt/2012/05/write-the-wrong-code-first.html">Write
the Wrong Code First</a> for the justification), I'd write the function I
needed, like <code>row()</code>, which creates a new Bootstrap row or
<code>maincontent()</code> which creates the main content area of the page.
Then I'd add that function to the <code>%exports</code> hash and everything
would work.</p>

<p>After the sixth function, keeping that list up to date was tedious. Then I
kept forgetting it. After all, any time you have to update the same data in two
places, you're doing something wrong.</p>

<p>Now the code looks more like:</p>

<pre><code>sub row <strong>:Export</strong>
{
    return &lt;&lt;END_HTML;
&lt;div class="row"&gt;
    @_
&lt;/div&gt;
END_HTML
}</code></pre>

<p>... with a single code attribute marking those functions which I want to
stuff into the template stash. I've used <a
href="http://search.cpan.org/perldoc?Attribute::Handlers">Attribute::Handlers</a>
before, but I always end up reading the manual and playing with things to get
them to work correctly. (Something about the way you have to write another
package and inherit from it to get your attributes to work correctly always
confuses me.)</p>

<p>My second attempt lasted no longer than ten minutes. I switched to <a href="http://search.cpan.org/perldoc?Attribute::Lexical">Attribute::Lexical</a>. This is almost as trivial to use as to explain:</p>

<pre><code>use Attribute::Lexical 'CODE:Export' => \&amp;export_code;</code></pre>

<p>Whenever any function has the <code>:Export</code> attribute, Perl wil lcall
my <code>export_code()</code> function:</p>

<pre><code>my %exports;

sub export_code
{
    my $referent = shift;
    my $name     = Sub::Identify::sub_name( $referent );

    return unless $name;
    $exports{$name} = $referent;
}</code></pre>

<p>The first argument to this function is a reference to the exported function.
I use <a href="http://search.cpan.org/perldoc?Sub::Identify">Sub::Identify</a>
to get the name of the function reference. (That wouldn't work for anonymous
functions, but I can control that here.) Then I store the name of the function
and the function reference in a hash.</p>

<p>It took as long to write as it does to explain.</p>

<p>A lot of people dislike the use of attributes. Used poorly, they create
weird couplings and plenty of action at a distance.
<code>Attribute::Handlers</code> can be confusing.</p>

<p>I like to think that I'm using attributes well here (even if I'm abusing TT2
more than a little), and that they've simplified my code so that I can avoid
repeating myself and performing manual busywork that I'm likely to forget. Even
better, the code to use them isn't magical at all: it's all hidden behind the
pleasant interfaces of <code>Attribute::Lexical</code> and
<code>Sub::Identify</code>.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Write the Wrong Code First</title>
    <link rel="alternate" type="text/html" href="http://www.modernperlbooks.com/mt/2012/05/write-the-wrong-code-first.html" />
    <id>tag:www.modernperlbooks.com,2012:/mt//1.447</id>

    <published>2012-05-09T18:37:54Z</published>
    <updated>2012-05-09T20:07:24Z</updated>

    <summary><![CDATA[I rewrite code often. If I were a better programmer, designer, or businessman, I would rewrite my code much less frequently&mdash;but I get things wrong about as often as I get them right. Even with years of practical experience, software's...]]></summary>
    <author>
        <name>chromatic</name>
        <uri>http://www.wgz.org/~chromatic</uri>
    </author>
    
    <category term="softwaredevelopment" label="software development" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.modernperlbooks.com/mt/">
        <![CDATA[<p>I rewrite code often.</p>

<p>If I were a better programmer, designer, or businessman, I would rewrite my
code much less frequently&mdash;but I get things wrong about as often as I get
them right. Even with years of practical experience, software's still too
difficult to predict with any degree of accuracy.</p>

<p>As a case in point, I've been revising some financial software in the past
week. In reviewing the calculations, I found a way to simplify them
dramatically.  Even better, these simplifications allow me to simplify the
interface and user experience.</p>

<p>That means rewriting a lot of code. That means throwing out code and
revising the storage model and making a lot of changes.</p>

<p>I'm fortunate to have a good test suite that runs in 15 to 20 seconds and
lets me know that everything I most need to work continues to work. That's a
lot of confidence. People who like to talk about test-driven development and
refactoring tout this as one of the benefits of well-tested software: you can
refactor with confidence.</p>

<p>I'm not refactoring. I'm throwing away parts of this application and adding
others. I'm changing how it behaves. Even though my test suite helps, that's
not refactoring.</p>

<p>As part of this project, I've added an SVG graph to a class of web pages. I
started by creating the SVG in Inkscape. Then I exported it as plain SVG. Then
I made a template for that SVG to include from the page template.</p>

<p>That was still the example SVG with sample data, still the proof of
concept.</p>

<p>I then extracted one piece of hard-coded data and made it a templated value.
One. Everything still worked. Then I extracted the second piece of data and so
on.</p>

<p>It's one step at a time. It's one change at a time. I'm using Git, so I
could even commit after every single change, no matter that it's a few
characters or even merely changing the color of a bar in the graph. I can work
in steps as small and discrete as possible, and then squash them into one big
commit or rewrite them into functional units, or do whatever I want with
them.</p>

<p>That's the same principle behind test-driven development (or test-driven
design or even behavior-driven development, if you need to hang a new name on
the same idea). Do one thing at a time. Make your code do a little more of what
it needs to do. Prove that it all hangs together, that it all works, that it
does what you intended.</p>

<p>Then clean up a little bit. That's refactoring, in your code and in your
tests. That's rebasing in Git.</p>

<p>Sure, I wish I could know exactly what I needed to write from the start. I
wish sometimes that programming were mere transcription of the voice of an
ephemeral muse (though I find it difficult to imagine a muse dictating Perl or
JavaScript or Haskell or J aloud). I wish I were the Beethoven of programming
(without the mercurial temperament and the hearing loss).</p>

<p>Usually I don't get things right from the start. Fortunately, a little
discipline and the willingness to work in small steps, to erect and replace the
scaffolding as I go, and I usually get a lot closer to the right code than
if I guessed.</p>

<p>Maybe that means I've thrown out more code than I've written. (It's satisfying to delete unused code, after all.) Maybe any project which starts as a proof of concept, then has to pivot in other directions to do what it's always needed to do always becomes a <a href="http://faculty.washington.edu/smcohen/320/theseus.html">Ship of Theseus</a>.</p>

<p>I'm okay with that. It's more important to me to create something useful and
then make it right than to wait on getting it right before other people can
find value in it. I may never write the right code from the start, but I
believe I can make almost-right code much, much more right, with discipline and
care and feedback.</p>]]>
        
    </content>
</entry>

<entry>
    <title>NYTProf, File IO, and an Optimization Gone Awry</title>
    <link rel="alternate" type="text/html" href="http://www.modernperlbooks.com/mt/2012/05/nytprof-file-io-and-an-optimization-gone-awry.html" />
    <id>tag:www.modernperlbooks.com,2012:/mt//1.446</id>

    <published>2012-05-07T21:56:41Z</published>
    <updated>2012-05-07T23:03:01Z</updated>

    <summary>One of my projects performs a lot of web scraping. Once every n units of time (where n can be days or weeks), a batch process fetches several web pages and extracts information from them. It&apos;s a problem solved very...</summary>
    <author>
        <name>chromatic</name>
        <uri>http://www.wgz.org/~chromatic</uri>
    </author>
    
    <category term="cpan" label="cpan" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="modernperl" label="modern perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="perl" label="perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="profiling" label="profiling" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="softwaredevelopment" label="software development" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.modernperlbooks.com/mt/">
        <![CDATA[<p>One of my projects performs a lot of web scraping. Once every <em>n</em>
units of time (where <em>n</em> can be days or weeks), a batch process fetches
several web pages and extracts information from them. It's a problem solved
very well.</p>

<p>I designed this system around the idea of a pipeline of related processes,
where each component is as independent and idempotent as possible. This has
positives and negatives; it's an abstraction like any other.</p>

<p>I initially wrote the "fetch remote web page" and "analyze data from that
page" as a single step, because I thought "analyze" was the main goal and
"fetch" was a dependent task. I separated them a couple of weeks ago to
simplify the system: analysis now expects data to be there, while fetching can
be parallel on a single or across multiple machines. (Testing the analysis step
is also much easier because feeding in dummy data is now trivial.)</p>

<p>I use the filesystem as a cache for these fetched files. That's easy to
manage. I modified the role I use to grab data for the analysis stage to look
in the cache first, then fall back to a network request. That was easy too. The
<code>get_formatted_data_for_analysis()</code> method looked something like:<p>

<pre><code>sub get_formatted_data_for_analysis
{
    my ($self, $type, $key) = @_;

    my $cached_path         = $self-&gt;get_cached_path( $type, $key );
    if (-e $cached_path)
    {
        my $text = read_file( $cached_path );
        return $self-&gt;formatter-&gt;format_string( $text ) if $text;
    }

    return $self-&gt;formatter-&gt;format_string( $self-&gt;fetch_by_url( $type, $key ) );
}</code></pre>

<p>I thought I was done. This trivial caching layer took five minutes to write and gave my project a lot of flexibility.</p>

<p>I thought this would speed up the processing stage, because I was able to
make the fetching stage embarrassingly parallel so that more than one fetch
could block on network IO simultaneously. My rough benchmark didn't show any
speed improvement, but it was fast enough, so I moved on.</p>

<p>On Friday I decided to profile the slowest stage of the application with <a
href="http://search.cpan.org/perldoc?Devel::NYTProf">Devel::NYTProf</a>. The
slowest stage was the processing stage. I isolated it so that it performed no
network fetching. It was still slow.</p>

<p>One of the formatter modules used to extract data from web pages is <a
href="http://search.cpan.org/perldoc?HTML::FormatText::Lynx">HTML::FormatText::Lynx</a>.
It allows me to run <code>lynx --dump</code> to strip out all of the HTML and
other formatting of a document. The formatter allows you to pass in the name of
a file or the contents of a file as a string.</p>

<p>For some reason, most of the time in the processing stage in the profile was
spent in file IO. That wasn't too surprising; these aren't all small files and
there may be thousands of them. I dug deeper.</p>

<p>Most of the time in the processing stage in the profile was spent in reading
the files in my method and reading files in the formatter&mdash;reading files,
even though I was passing the contents of those files to the formatter as
strings.</p>

<p>I poked around at a few other things, but came back to the source code of
the formatter. A comment in <a
href="http://search.cpan.org/perldoc?HTML::FormatExternal">HTML::FormatExternal</a>
says:

<blockquote><code>format_string()</code> takes the easy approach of putting the
string in a temp file and letting <code>format_file()</code> do the real work.
The formatter programs can generally read stdin and write stdout, so could do
that with <code>select()</code> to simultaneously write and read
back.</blockquote>

<p>In other words, all of the work I was doing to read in files was busy work,
duplicating what the formatter was about to do anyway. (Okay, I stared at the
code for a couple of minutes, thinking about various approaches of rewriting it
and submitting a patch or monkey patching it. Then I turned lazier and wiser.)
I rewrote my code:</p>

<pre><code>sub get_formatted_data_for_analysis
{
    my ($self, $type, $key) = @_;

    my $cached_path         = $self-&gt;get_cached_path( $type, $key );
    return $self-&gt;formatter-&gt;format_file( $cached_path ) if -e $cached_path;

    return $self-&gt;formatter-&gt;format_text( $self-&gt;fetch_by_url( $type, $key ) );
}</code></pre>

<p>The result was a 25% performance improvement.</p>

<p>Three things jumped out at me in this process. First, how nice is it to have
a working tool like NYTProf and a community that distributes source code, so
that I could examine the whole stack of my application to isolate performance
problems? Second, how interesting that an assumption and an admitted shortcut
in a dependency could have such an effect on my own code. Third, how much more
I like my new code with all of the file handling gone; pushing that
responsibility elsewhere is a nice simplification without the performance
improvement.</p>

<p>Perhaps the two tools I miss most from my C programming days are
Valgrind/Callgrind and KCachegrind, but NYTProf goes a long way toward filling
that gap. Besides, I'm at least 20 times more productive with a language like
Perl.</p>
]]>
        
    </content>
</entry>

</feed>

