"Hmm," I found myself thinking the other day. "I've found and fixed quit a few potential bugs in this client application related to Unicode. Allison and I just went through and normalized user input so as to avoid casefolding errors. I wonder what happens if I try to register with a UTF-8 password."
Like all applications with a decent security policy, this application immediately hashes user passwords (it uses SHA-1 hashing instead of Bcrypt, but one thing at a time). When it creates a new user record, it uses Perl's Digest::SHA to hash the password before storing it in the database. When a user attempts to log in, the application performs a database query to look up the provided email address and the password, with SQL something like:
SELECT person_id FROM person WHERE primary_email = ? AND passphrase = sha1(?);
The assumption seemed reasonable; because SHA-1 is an algorithm with its details widely published and implemented, both PostgreSQL and Perl should provide the same hash, given the same input.
I took Tom's example from the Perl Unicode Cookbook's casefolding recipe (because I felt like this work was the data equivalent of rolling a boulder up a hill) and added a case to our registration tests with a password of Σίσυφος.
Digest::SHA1 croaked, complaining about wide characters.
I looked over the code again. I'd enabled UTF-8 literals. I'd saved the file with the proper encoding. We'd fixed the encoding of input and we were normalizing all input to the NFC form. Everything looked right.
Then, buried in the documentation of Digest::MD5, I found a reference that suggested that that module explicitly does not handle wide characters—that it only works on strings of 8-bit characters. Anything outside of Latin-1 is just out.
The documentation suggested explicitly transcoding a UTF-8 string to Perl's internal octet-based encoding, then performing the digest...
... but when I did that, Perl and PostgreSQL disagreed about the resulting hash.
The super nice thing about standards is what they don't mention about the assumptions they make, and how they leave those assumptions up to implementations, and how when people try to do the right thing and run right up against those assumptions, sometimes they find out the difficult way that competing implementations have chosen very different approaches.
I spent the rest of the afternoon chasing down every place in the source code which hashed passwords in the Perl layer and changed them all to hash passwords in the database layer. All tests passed.
This bothers me for two reasons. First, I don't know which of
Digest::SHA or PostgreSQL is doing the right thing, because I
don't know what the right thing is. I can make a case for both behaviors,
depending on whether I care more about doing what the user intends or being
strict about the data at the interface. I've argued it both ways explaining it
Second, I went to all of this work to prevent bugs from occurring and to do the right thing for people who'll probably never notice that our code does the right thing—and I'm sure almost every website I've ever used in my life gets this wrong, including (especially?) banks.
That's only slightly horrifying.