A Collection of Silly Little Snippets

| 6 Comments

(See also A Modern Perl Fakebook.)

I needed to extract all hyperlinks from an HTML document today, and I needed to remove all markup except for the simplest formatting: paragraphs, emphasis, and bold. Any experienced Perl 5 programmer knows that multiple CPAN distributions exist for doing just this. You can choose whether you want an XS wrapper around an existing C library, or a smattering of regular expressions, or a DIY HTML parser, or a simple wrapper around an HTML parser.

You can know that, but that doesn't tell you how to do it.

I did my research and decided on HTML::Scrubber to remove markup. Its documentation suggests a strategy more complex than my current needs, so I eventually produced:

my $scrubber = HTML::Scrubber->new(
    allow => [qw( p br i u strong em hr )],
);
my $scrubbed = $scrubber->scrub( $content );

For extracting hyperlinks, I used HTML::LinkExtor and wrote:

sub get_links
{
    my ($self, $content) = @_;

    my $p = HTML::LinkExtor->new();
    $p->parse($content);

    my @links;

    for my $link ($p->links())
    {
        my ($tag, %a) = @$link;

        next unless $tag eq 'a' && $a{href} && $a{href} =~ /^http:/;

        push @links, $a{href};
    }

    return \@links;
}

I don't mind doing the research and customizing snippets like these for my specific needs, but I can imagine countless other people needing examples like this. If I weren't already convinced that the world needs a new resource for copy and paste examples in Modern Perl, I would be.

(If you're still not convinced, consider how much more easily a novice could find these examples than writing a correct and comprehensive regular expression for either case.)

6 Comments

If you ever create a repository of examples like that - it would be useful if the examples were coupled with tests and a way to specify dependencies ensuring that they work correctly in the user system. In my opinion many of CPAN packages are really just glorified examples with the additional feature of having tests. I even blogged about this idea once.

there are plenty of snippet sites, all of them categorize or tag by language. why not submit these there?

Hm, didn't post this on the first try, trying again:

I agree that it is a good idea to make easily accessible snippets of such things molded to accomplish common tasks would be worthwhile.

I do not think however that another website is necessary for that. What you're asking for basically has a precedent on CPAN already. Take a look at LWP::UserAgent versus LWP::Simple. That's exactly what you're thinking about, isn't it? A simplified interface to the "jack-of-all-trades" module that comes pre-configured with a bunch of sane defaults for the most common cases.

If anything i think it would be best to find modules that would benefit the most from having ::Simple versions and get to writing and releasing those. (In fact, that's what I'm doing right now with CGI::CRUD, even though it's turning more into a full-blown rewrite.)

A-Ha! I think your preview function is broken. If i write a post, click preview, and then try to submit from the preview it just returns me to the blog entry without doing anything. (Using Win32 Opera.)

@mithaldu - I agree that this could be done with CPAN - but example is not the same thing as a simplified API. ::Simple modules are a great thing - and I don't say anything against them - but we need also ::Example modules that would pack examples of how to use the API together with tests and would not change the API at all.

Hello,
I was just looking for modules to do this.

When I read the perldoc for HTML::Scrubber it said:

---
I wasn't satisfied with HTML::Sanitizer because it is based on HTML::TreeBuilder, so I thought I'd write something similar that works directly with HTML::Parser.
---
could someone please, elaborate a bit more about the problems with HTML::TreeBuilder or the advantages of HTML::Parser?

Do someone has any experience to share about using HTML::Sanitizer vs HTML::Scrubber?

--Pablo Marin-Garcia

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

affiliated with ModernPerl.net

Categories

Pages

About this Entry

This page contains a single entry by chromatic published on May 26, 2010 4:53 PM.

The Anethics of Innovation and Disclosure was the previous entry in this blog.

How to Parse Perl 5 on the JVM is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.


Sponsored by Blender Recipe Reviews and the Trendshare how to invest guide

Powered by the Perl programming language

what is programming?