January 2013 Archives

Why Unicode Normalization Matters


Summary: if you haven't read the Perl Unicode Cookbook yet, you're not ready to handle text in the 21st century.

Because I have some experience with writing automated tests for software, I have seen plenty of ways in which software can fail. (If you want to develop a healthy paranoia, start writing tests for the bugs you find. If you want to develop an unhealthy paranoia, keep a list of the categories of bugs you find, and look for those when you're writing or editing tests.)

One of my projects has a multinational component with lots of international use. I've spent a lot of time working on its search features, because that's where the project provides most of its value. A couple of months ago I read through the code and thought about all the ways things could go wrong, and realized that we had a severe bug that no one would want to debug when reported.

Part of the search feature allows you to search for entities by name. I worked on a wildcard search, where users provide part of the name and the database works out the rest. That's all well and good, until you start thinking about "Wait, does capitalization matter?" Then I forced everything to lowercase.

Then I became really paranoid.

We already have entity names in our database with non-ASCII characters. (We're fortunate enough to be able to stick with UTF-8, but it took a couple of days to work through all of the details to handle UTF-8 correctly.)

One of the problems with the naïve "let's just lowercase everything" approach is that some Unicode characters don't lowercase the way you expect. Tom's case-insensitive comparisons recipe goes into more detail.

I said "Wait, wait. We need to be sure we're using Perl 5.16 as soon as possible so that we can use the correct case in our searches." (Then I started doing research about whether PostgreSQL handles casefolding properly and felt sad for a while, because I couldn't prove that it did the right then.)

Then I felt even more paranoid.

Suppose you work for a consulting shop called "Naïve about Umlauts" and you want users to be able to search for you by typing "naïve" in our search box. If our software is as naïve about Unicode as your company is about diacritics, some users might get results while others won't. It all depends on how they type the query and what their software sends to our server.

Here's a fun fact about Unicode: you can represent the same character (i with an umlaut) with multiple codepoints. It can be a single codepoint (lowercase Latin I with a diacritic, or \x{ef} in Perl terms) or two codepoints (the lowercase Latin I and a combining diacritical mark, or \x{69}\x{308} in Perl terms).

Because Unicode is just a series of numbers when the computer really gets down to looking at strings, these two strings look different to the computer even though anything aware of UTF-8 ought to render them the same way. (Imagine responding to bug reports with "Well, how did you type it? Don't do it that way next time." Good luck.)

If only Unicode had some way of representing text in a canonical form you could use to sort and search and compare—fortunately, it does. Unicode normalization offers several standard representations which you can use to solve just this problem. Throw a little Unicode::Normalize into place (always decompose and recompose Unicode data at the boundaries of your application) and you won't lose years of your life chasing down weird bugs (at least for those of your projects where you're fortunate enough to use a modern Perl or another language with working Unicode support).

In my experience, the NFC Unicode normalization form is most effective.

When you work with a relational database, sometimes writing safe and correct programs requires you to bundle up several database changes into a single transaction which you can apply or reject atomically.

In DBIx::Class, that's through the txn_do method, which looks something like:

    # make changes here
    # make other changes here

Pass an anonymous function to this method. The method will start a transaction, invoke the anonymous function, then commit or roll back the transaction depending on the success or failure of the anonymous function. It's a good pattern, because the anonymous function can do arbitrary things; the transactional wrapper imposes very few requirements on it.

(PSGI follows a similar pattern.)

In a recent commit on a client project, I wanted to add transactional semantics to a DDL change. In effect, if creating a custom database view failed, I wanted to roll back the change without replacing an existing view. If replacing an existing view succeeded, I wanted to commit it.

I could have used the anonymous function pattern, but I chose a different approach:

sub run_in_transaction {
    my ($self, $method_name, $args) = @_;

    local $@;
    my $status = eval { $self->$method_name( @$args ); $self->commit; 1 };

    return $status if $status;
    die $@;

Instead of wrapping up the other code in a transactional block, to use this method you pass the name of another method in the same class to call along with an array reference of arguments to pass to that method.

Is this easier to use? In some circumstances, but it also has its limitations. It only works for methods on the same object. It only works for single statements. When those conditions are true, I think this code is clearer> than the anonymous function approach.

Will it last in our code after a few more rounds of refactorings and revisions? I don't know—but it solved the DDL update problem elegantly enough for now.

DBIC can't get away with anything this simple, because it needs to provide a consistent and coherent solution without imposing artificial limitations. This code has no need for that generality yet. I could just as well have used the anonymous function, and I may in the future, but for now this satisfies my needs. It's just another technique which you may find useful in specific circumstances.

How Forking Perl 5 Could Work


Stevan Little (the man behind Moose) gave a talk at the Orlando Perl Workshop called Perl is not Dead, it is a Dead End. The talk culminated with an announcement of an experimental reimplementation of the useful parts of Perl 5 in Scala, a project called Moe.

This is not the first fork or pseudo-fork of Perl 5. The Topaz project eventually begat Perl 6, which begat Parrot (the way I understand Parrot today is that it's what you get if you try to write a virtual machine in C to run Perl 5.006 effectively, and then check in a lot of suboptimal code hoping that magical elves will somehow coalesce out of the luminescent aether to fix your mess). Later on came Kurila which threw out some of the awkward parts of Perl 5 (and some of the useful parts) in an attempt to gain more speed. Lately, something called Perl 11 looks like an attempt funded by CPanel to add optional optimization levels to a pre-5.10 Perls. (I suspect the people behind Perl 11 will object to this characterization, but I find the lack of specificity frustrating.)

Now comes Moe.

Will it succeed? I don't know. Before anyone can address that question responsibly, you must understand what a fork can or should fix.

What's Wrong with Perl 5?

Perl 5 has two main problems: its implementation and its implementation. Any time you think you've found another problem, look deeper and you'll discover it's one of those two problems (and very likely "its implementation" is the culprit").

The Perl 5 VM is a big wad of accreted code that, in some places, traces its lineage back to Perl 1. It's exactly what you'd expect from code written in the era where "make it fast" meant "Write it in C, come hell or high water" and it shows. Where a Smalltalk might be implemented in something like Slang or PyPy in RPython, Perl 5 doesn't do that.

That choice almost certainly made sense in 1993. By the time of Topaz in 1999, C made less sense. By the time of Parrot in 2001, C made little sense. In 2013, C makes almost no sense.

Bear in mind that my argument is "Writing the whole thing in C makes little sense for a language larger than Lua".

Why moe Might Fail

"Almost as good as Perl 5" isn't that compelling. If I wanted to use a faster and better language with worse deployment and fewer libraries, I'd write more Haskell.

Perl 5 has no language designer with singular vision and the time and taste and experience to shape the language in a cohesive gestalt. (That's probably the worst part about Perl 6—it took Larry's attention away from a working product to something which so far has not delivered anything usable in and of itself.)

Technical reasons, like "Wow, the JVM isn't really a good platform for a Perl!" or "The subset of Perl 5 that's practical to implement is basically Groovy and that already exists." I'm not predicting these specific cases, mind you. I offer them as examples of technical reasons which may exist. (Even though I suspect the JVM really isn't a good platform for a Perl.)

Social reasons, like "This is more work than we thought, and it depends entirely on volunteer labor." (That excuse worked for Perl 6 for a while between the time TPF stopped giving $50k grants and the Ian Hague "Get this in a usable state in the next couple of years to attract more grant money!" grants didn't achieve their goals.)

Can moe Deliver?

It's possible moe can work. It has to avoid two traps:

  • Falling into a local maxima because of the limitations of the underlying technology. (It would be mean of me to call this the Rakudo-on-other-VMs Trap, so I won't.) For example, it is so exceedingly difficult to implement a language with decent performance when the semantics of how you use memory and where you get that memory and how you release it and when you release it are different from the assumptions that the VM and its optimizer and any JIT and tooling and extensions expect that you would have to be the combination of Michael Abrash, Mike Pall, John Carmack, Cliff Click, and quite possibly Pablo Picasso to make it work well across multiple VMs.
  • Ossifying into something that can't change before it produces sufficient utility. (It would be mean of me to call this the Parrot Support Policy trap, but at someone who argued both sides of that support policy at various times, it's one of the most important reasons why Parrot and Rakudo locked into their whirlpool of mutually irrelevant destruction.) The best general use projects I've seen have found themselves extracted from specific situations only at the point where the specific project can support the necessarily generalization and where there's enough external knowledge about the needs of the extracted process that such generalization is possible.

In other words, it's a mistake to commit to supporting internal details until you're certain that those internal details will remain in place without reducing or removing your ability to make necessary changes for future improvements. Both Perl 5 and Parrot fell into this trap.

What moe Could Produce

If I were to implement a language now, I'd write a very minimal core suitable for bootstrapping. (Yes, I know that in theory this is what NQP or whatever it's called these days in Rakudo is supposed to provide, but unless NQP has changed dramatically, that's not what it is.) Think of a handful of ops. Think very low level. (Think something a little higher than the universal Turing machine and the lambda calculus and maybe a little bit more VMmy than a good Forth implementation, and you have it.)

If you've come up with something that can replace XS, stop. You're there. Do not continue. That's what you need.

Then I'd figure out some sort of intermediate tree- or graph-based structure suitable to represent the language as I'd like to implement it. (Rakudo has a decent implementation here. If it had gone into Parrot five or six years ago, the world would be a different place.)

Then I'd produce a metaprogramming system, something of a cross between Moose's MOP and the Perl 6 metamodel. (Rakudo gets this very right. Credit where it's due. If Parrot had managed to adopt this in 2011... well, that's a long rant for another time.)

With those three things, I believe it's possible to build all of the language you need. If you're clever—if you're careful—you can even produce a system where it's possible to create and modify a sublanguage through metaprogramming but limit the scope of those sublanguages to specific lexical scopes in your system. (That idea is probably the idea that Perl 6 the language gets most correct. It's also very Lispy, in the sense that it's tractable in Lisp, but fortunately for the rest of us programmers, Perl actually has syntax.)

Figuring out a bytecode system is mostly irrelevant. (Freezing bytecode too early cost Parrot a lot of time and energy.) Figuring out a replacement for XS is essential (everyone says that Perl 5's parser is the biggest barrier to alternate implementations, but the utter dependence of the CPAN on XS and the haunted Jenga horrors of XS and the Perl 5 internals is the biggest barrier to adoption of alternate implementations).

Breaking the dependence of the CPAN on the Perl 5 internals—even in a small way—while allowing the evolution of the language and the annealing of the implementation over the specification toward completeness (annealing in the AI sense, not necessarily metallurgy) may be a viable path to producing a Perl 5.20 which allows optional backwards compatibility if you want it and usable new features if you need them.

Notice that this plan ties the implementation of an alternate Perl 5 to no one specific backend. I suspect that RPython will demonstrate that it's workable, while I'm tempted to suggest that LuaJIT has possibilities. (Again, I think it's 70% likely that the JVM and the CLR will prove themselves workable for the first half of implementation and completely bizarro-land useless for the second 80%, but Rakudo will demonstrate that soon enough.) I don't know about JavaScript VMs.

Is this worth doing?

Hey, I've paid my dues trying to implement a modern VM and modern Perl implementation.

The real question is whether an alternate implementation of Perl 5 can demonstrate its value for real programs before Booking.com's money runs out keeping Perl 5 on life support. It's clear that the current Perl 5 implementation will never get the kind of attention it needs to introduce it to the 21st century, and it's pretty clear that no Perl 6 implementation right now is anything other than years away from being useful, let alone of interoperating with a Perl 5 implementation.

I don't think Perl 5 is flirting with irrelevance. I do think that every year that goes by with Perl 5 ossifying further in implementation makes it less likely that the necessary changes will happen. (The code doesn't get much cleaner, the likely implementers get busier and less interested, and the fashion-driven Silicon Valley marketing machine keeps vomiting out new trends you absolutely must adopt right now, you creeping dinosaur, which is a distraction of sorts.)

If the CPAN has proven anything, it's that one size doesn't always fit every program. Maybe the p5p model of trying to please everyone (and generally only pleasing sysadmins with Perl 4 code they reluctantly last updated in 1993 only because they started it in 1987) doesn't fit all either, and maybe an alternate implementation of Perl 5 will produce a viable model to reinvent Perl 5's internals.

Another year (another month), another set of vapid news posts that proclaim language $x or platform $y has won, whatever that means, for the latest astrological milestone.

Sure, it's fun to treat programming and technology as a horse race, where someone must win and someone must lose, but if you're in the business of solving real problems to help real people do important things (I'm in the business of just that), there comes a time at which you have to decide between inflating the stats to edge ahead in that horse race or getting things done.

(Someone more cynical than me might suggest that you consider who promotes certain technologies and the horse race mentality and what they want you to spend your money on—consulting fees, books, conferences, collaboration services they just so happen to sell, their venture capital contests—but I'm not that cynical, so forget I just typed that.)

For example, as much as you hear that HTML 5 is the future, that all applications will be mobile, running with JavaScript or some sort of compatibility shim at the worst, consider how many microprocessors shipped last year, microprocessors which get programmed with one of assembly, C, or C++. These run your car, your microwave, your phone, your electric toothbrush, your appliances, your watch, almost everything.

These billions of devices don't show up in a Google Trends search because they're ubiquitous. That is popularity, not what shows up most in Hacker News threads.

By all means talk about the wonderful things you're building to solve real problems for paying customers. (I hear there's an office full of people in Taiwan whose lives I make a little bit easier several times a week. Sometimes it's Perl. Sometimes it's a little bit of JavaScript, yes. Sometimes it's SQL. Sometimes it's a tiny shell script. Today I almost even wrote some C.)

Yet don't confuse the incessant sound and fury of the horse race and the propaganda it represents with actually doing things.

(Alternate title: While you were writing your own web server at the bare metal layer with Node.js, I saved one of my clients a million dollars.)

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

sponsored by the How to Make a Smoothie guide



About this Archive

This page is an archive of entries from January 2013 listed from newest to oldest.

December 2012 is the previous archive.

February 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Powered by the Perl programming language

what is programming?