Why Unicode Normalization Matters

Summary: if you haven't read the Perl Unicode Cookbook yet, you're not ready to handle text in the 21st century.

Because I have some experience with writing automated tests for software, I have seen plenty of ways in which software can fail. (If you want to develop a healthy paranoia, start writing tests for the bugs you find. If you want to develop an unhealthy paranoia, keep a list of the categories of bugs you find, and look for those when you're writing or editing tests.)

One of my projects has a multinational component with lots of international use. I've spent a lot of time working on its search features, because that's where the project provides most of its value. A couple of months ago I read through the code and thought about all the ways things could go wrong, and realized that we had a severe bug that no one would want to debug when reported.

Part of the search feature allows you to search for entities by name. I worked on a wildcard search, where users provide part of the name and the database works out the rest. That's all well and good, until you start thinking about "Wait, does capitalization matter?" Then I forced everything to lowercase.

Then I became really paranoid.

We already have entity names in our database with non-ASCII characters. (We're fortunate enough to be able to stick with UTF-8, but it took a couple of days to work through all of the details to handle UTF-8 correctly.)

One of the problems with the naïve "let's just lowercase everything" approach is that some Unicode characters don't lowercase the way you expect. Tom's case-insensitive comparisons recipe goes into more detail.

I said "Wait, wait. We need to be sure we're using Perl 5.16 as soon as possible so that we can use the correct case in our searches." (Then I started doing research about whether PostgreSQL handles casefolding properly and felt sad for a while, because I couldn't prove that it did the right then.)

Then I felt even more paranoid.

Suppose you work for a consulting shop called "Naïve about Umlauts" and you want users to be able to search for you by typing "naïve" in our search box. If our software is as naïve about Unicode as your company is about diacritics, some users might get results while others won't. It all depends on how they type the query and what their software sends to our server.

Here's a fun fact about Unicode: you can represent the same character (i with an umlaut) with multiple codepoints. It can be a single codepoint (lowercase Latin I with a diacritic, or \x{ef} in Perl terms) or two codepoints (the lowercase Latin I and a combining diacritical mark, or \x{69}\x{308} in Perl terms).

Because Unicode is just a series of numbers when the computer really gets down to looking at strings, these two strings look different to the computer even though anything aware of UTF-8 ought to render them the same way. (Imagine responding to bug reports with "Well, how did you type it? Don't do it that way next time." Good luck.)

If only Unicode had some way of representing text in a canonical form you could use to sort and search and compare—fortunately, it does. Unicode normalization offers several standard representations which you can use to solve just this problem. Throw a little Unicode::Normalize into place (always decompose and recompose Unicode data at the boundaries of your application) and you won't lose years of your life chasing down weird bugs (at least for those of your projects where you're fortunate enough to use a modern Perl or another language with working Unicode support).

In my experience, the NFC Unicode normalization form is most effective.

8 Comments

Jakub Narębski | January 26, 2013 3:22 AM

Why not use Unicode::Collate or similar tool to perform Unicode-aware search and comparison?

This way you can, with appropriate comparison level, find "naïve" by searching for "naive". Isn't that a win?

Humberto Massa | January 26, 2013 6:18 AM

I usually prefer the decomposed form for searchs, because you can strip all the \pM marks from the text, so naive would match naïve and vice-versa...

vsespb | January 26, 2013 10:19 AM

Having same problem with MacOSX HFS+ filenames, it's stored in NFD form, even more, some Apple's subset of this form. And in theory different MacOSX versions use different subset. (plus HFS+ is case insensetive by default but can be case sensetive)

There is a module for "Mac" normalization http://search.cpan.org/~tomita/Encode-UTF8Mac-0.03/lib/Unicode/Normalize/Mac.pm

BTW Maybe for search it's better to use NFKC/D form ?

use Unicode::Normalize;
use open qw/:std :utf8/;
use utf8;
print NFKC("₉") eq "9";
print NFKC("⁹") eq "9";

(prints two true values).

Allan Engelhardt | January 27, 2013 6:35 AM

1. Search should normally use the NFK[CD] forms, as @wsespb points out. You want 'fish' to match "\x{fb01}sh" ("\N{LATIN SMALL LIGATURE FI}sh").

1b. Maybe you should use both. (Both 's' and "\x{1e69}" should probably match "\x{1e9b}\x{0323}", to use the Figure 9 example from the Unicode note.)

2. If somebody did a version of the NFK?[CD] routines that did not normalize singletons, then I would be happy. Never understood this bit. To my mind, \x{ef} and \x{69}\x{308} unambiguously carry the same _meaning_, just as surely as \N{ANGSTROM SIGN} and \N{KELVIN SIGN} has a different (more precise) meaning than \N{LATIN CAPITAL LETTER A WITH RING ABOVE} (Å) and \N{LATIN CAPITAL LETTER K} (K). For a scientific search capability I worked on we had to go through hoops like you wouldn’t believe….

npongratz | January 28, 2013 7:43 AM

@vsespb: I believe you are correct depending on the data you are searching. The Unicode Consortium deems NFKC suitable for "loose matching ... These two latter normalization forms [NFKC/NFKD], however, do lose information and are thus most appropriate for a restricted domain such as identifiers." [0]

[0] http://www.unicode.org/faq/normalization.html

nwellnhof | January 29, 2013 8:15 AM

I'd also recommend NFKC for full text search. Another nice trick is to strip all combining marks after decomposition, so "naive" and "naïve" are equivalent. Anyone who's interested in a search engine that can do this out of the box might want to have a look at Apache Lucy:

https://metacpan.org/module/LOGIE/Lucy-0.3.2/lib/Lucy/Analysis/Normalizer.pod

chromatic replied to comment from Jakub Narębski | February 2, 2013 6:06 PM

The search takes place in PostgreSQL, so I'd have to write a PL/Perl extension. That's possible, but I'd rather let the database do as much of the work as possible.

chromatic replied to comment from Humberto Massa | February 2, 2013 6:07 PM

We'd end up storing more data (searchable columns versus displayable columns), but that's an interesting approach. I can see its advantages.

Tags:

8 Comments

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry