Perl 5's Unicode Flag Day

Managing Unicode properly isn't exactly easy even in 2011.

Perl 5.14 makes Unicode somewhat easier with the optional unicode_strings feature, but you have to enable it explicitly, and you can only handle external data correctly if you know the intent of that external data.

(One of the small details I like in the book Gravitas, published by my company, is the documentation of the main character's struggles in one chapter with a Unicode bug exacerbated by one too many assumptions about characters versus bytes in his project's ORM. Art imitates life as satire.)

Tom Christiansen's Why Does Modern Perl avoid UTF-8 By Default missive is classic tchrist—clever, articulate, detailed, and a wave of text which crashes over the unsuspecting like a sneaker wave with a sinister undertow. If you're not careful, it'll lead you in a direction you never suspected.

You can see this when a smart person such as Nelson Minear claims that "Perl 5 can't handle Unicode properly". Aristotle caught his attention and Nelson offered a respectful retraction...

... but be careful not to miss the main point.

Handling Unicode appropriately is difficult, even in 2011. Many of Tom's very valid points are repeated reinforcements of the notion that your software, my software, everyone's software makes several assumptions about what incoming and outgoing data means. When those assumptions are wrong, you get bugs.

If 14 May 2010—the release date of Perl 5.14—had been Perl 5's Unicode flag day, such that perl assumed that all incoming data and all outgoing data were Unicode unless explicitly marked otherwise, Perl 5 programmers and users alike would discover exactly how many assumptions we've made. Some of them we can fix easily. Some of them we can't. Some of them require further fixes to the Perl 5 core itself, and some of them require operating system vendors and distributors to fix their own software.

This job isn't easy and it won't be quick.

I'm all for making progress and for making painful changes to improve the present and future for preset and future programmers, but the benefits have to outweigh the costs. Right now, they don't. Hopefully that day will come soon.

If you would like to enable UTF-8 everywhere in your Perl 5 programs, see Mike Doherty's utf8::all.

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

affiliated with ModernPerl.net

Categories

Pages

About this Entry

This page contains a single entry by chromatic published on June 8, 2011 1:25 PM.

Four New Perl Books Underway was the previous entry in this blog.

Making Catalyst Session Flash Methody is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.


Sponsored by Blender Recipe Reviews and the Trendshare how to invest guide

Powered by the Perl programming language

what is programming?