April 2010 Archives

Should Novices Prefer cpanminus?


I'm finishing the first draft of Modern Perl: the book. Part of that process is clarifying the intended audience:

I assume readers have some familiarity with Perl. They should have it installed and should know how to write, edit, save, and run Perl programs. They don't necessarily have to have finished reading a tutorial such as Learning Perl or Beginning Perl, but they should be sufficiently familiar with programming to be able to follow along with examples.

I try not to assume complete knowledge of even basic constructs; I try to explain them in detail, as understanding subtleties of design and implementation are important to mastering the subject of Perl.

Part of that process is deciding what's important to cover and why. For example, any book which discusses modern Perl has to discuss the CPAN and CPAN clients and installation of distributions from the CPAN. (You can't get away with writing about modern Perl without recommending Moose or Try::Tiny, for example.)

I don't want to assume that readers have configured their CPAN client correctly, nor that they have installed distributions from the CPAN before. Yet I also don't want to write pages of tutorials on configuring CPAN.pm versus CPANPLUS. I'd rather link to a tutorial somewhere like on the Perl 5 Wiki and get on with the work of explaining how to understand Perl than how to perform system administration.

Then a couple of people said "Why don't you tell them to use cpanminus instead?"

That offers advantages, especially in its speed, its lack of output when things just work, and the lack of necessary configuration.

That also offers disadvantages, in that there's less documentation in the wild about how to use cpanminus. As well, it's a young project and may not prove as long-lived as the other clients. My final concern is about debugging failed builds, tests, or installations. Though that's not always easy or obvious for users to debug with the other CPAN clients, I wonder if the additional step of skimming the build.log created on a failed installation is one level of difficulty too much for users.

Then again, when things go right, cpanminus is so much easier to use, it's almost no contest.

Would you recommend that novices skip over CPAN.pm straight to cpanminus now?

Test-Driven Learning

I learned Perl in the late '90s, after I took a job as a system administrator. I'd programmed a little bit in my previous job: I'd written a small application for the customer service help desk, and I'd written a proof of concept "Notify me when this website has updated!" system that somehow never had its rewrite from Bourne shell to Java 1.1.

After spending two months fixing everything my predecessor had left unfinished, I had a lot of spare time. Good system administrators do. That's why they're so prickly; they hone their arguing skills by bickering on Usenet all day.

Instead, I taught myself Perl. I read a couple of books. I wrote a few programs. Yet the most important technique I've ever learned was to read and write and modify the code of other people.

Back then, I had to read comp.lang.perl.misc for Perl questions. I'd look at the example code, when provided, then run it on my own machine and see if I could figure out the answer. This is much easier now. Now you can go to PerlMonks (for one example) to find tens of thousands of questions and answers and comments.

Short examples were always the best. I had to learn syntax and idioms. Syntax is easy, if you know how to use perldoc. Idioms are more work, but if you pay attention, you can often get invaluable explanations.

Even still, the best way for me to learn was through experimentation. I wish back then I had what I have now, and that's a test framework for learning:

use Test::More 'no_plan';

sub some_example_function { ... }

is( some_example_function( 'a string' ), 'some result', '...' );
is( some_example_function( 1234567 ),    7654321,       '...' );


Writing hundreds—perhaps thousands— of tiny programs to test pieces of Perl syntax or idioms or techniques and to find out what breaks, what changes, and how I could modify or chain them taught me more than reading a dozen books ever could. In effect I performed what every halfway-decent programmer already did when building programs: ad hoc testing in small pieces.

Test-driven learning offers the same advantage in learning how to program and how to program in a given language and how to program well in a given language that it does to writing and designing and maintaining an application: a formalized system for immediate, unambiguous feedback. Each passing assertion I never wrote would have forced me to express and to formalize my understanding of how Perl works. Each assertion is "yes" or "no". Assertions tend to be small and self-contained. Well-written assertions can read like English descriptions of the expected behavior.

Novices who will become good programmers do this already. Perhaps novices who adopt a test-driven learning system will become better programmers.

From Novice to Adept: Perldoc


Some people say that Perl (at least versiosn 1 Perl 5) is a cleaned up dialect of the language called Unix. Certainly that's how I develop. Unix is my IDE, and I use Unix tools as much as possible.

You can see the schism between Unix developers and everyone else in the Perl world. Cygwin doesn't get as much attention and testing and bugfixing as it deserves. Dealing with shared libraries and installation on Windows and Mac OS X often requires special skills and knowledge and dedication that isn't always available or obvious or interesting to those of us for whom Unix and the free Unix-alikes just work.

For maximum fun and frustration, install Strawberry Perl on a friend's computer, then tell them they're going to learn how to program. Try to explain that your preferred approach mixes several terminal windows with GNU Screen and copious command-line utilities, then try to get them past the "Hello, world!" stage. (The real point of that exercise is to teach you how to write, save, compile, link, and run programs. Skip any step not necessary in your language.)

Certainly the bundling of Padre helps, but the mindset of the Unix hacker runs deeply through Perl culture. This is not a bad thing; it's the source of much goodness in the language and its ecosystem. Yet you can't be a productive Perl hacker unless you know that it's there.

Consider perldoc. A very naïve count of words in pod/ in bleadperl today suggests that the core Perl 5.13.0 distribution contains 740,000 words of documentation. Take out the deltas between releases and you still have 576,000 words of documentation. That's almost six novels worth of books, unless you're a prolific fantasy author, in which case that's the filler in your bookshelf-destroying series. That's only the core documentation. That doesn't count the documentation of the core libraries.

If you want to be a good Perl developer, you have to know that it exists and how to use it.

perldoc perltoc lists and describes all of the documents in the core documentation. Type the name of any of the files listed as the argument to perldoc to learn more.

perldoc perlfunc describes Perl's built-in functions, like push and chomp. If you don't remember the name of a function, you can skim through this file (especially its listings of functions by category) to find it. Most adept Perl programmers use perldoc -f funcname, however. I can't remember the order of return values from caller, so I type perldoc -f caller and skim the example code.

Get used to referring to the documentation. That's how good programmers work.

If you're not sure what you need to look up, but think you know how to describe it, perldoc -q keyword searches the Perl FAQ for the appropriate question. I use this less than perldoc -f, but I don't ask many of those questions about Perl.

Any module worth using has documentation. Type perldoc Module::Name to read its documentation.

There's plenty more documentation to read, such as perldoc perlsyn, which explains the language's syntax or perldoc perlop which describes operators. Even so, if you only know the -f and -q flags and the existence of perldoc perltoc, you're well on your way to understanding Perl.

More and more I realize that good software design minimizes the amount of things you have to care about at any one time. Well-designed programs take advantage of abstraction possibilities of languages and libraries to model the problem and its solution in the most effective way. Well-designed languages minimize the syntactic concerns necessary to produce those abstractions.

I unsurprising news, the default Perl 5 object system shows its limits in that you have to think about Perl 5 reference syntax and objects and encapsulation, genericity, abstraction, and polymorphism all at once. Moose encourages people to do the right thing by providing abstractions that encapsulate the concerns of other levels of abstraction. Inside-out objects did something similar.

I realized this yesterday when writing about the state feature introduced in Perl 5.10. If you're a fan of minimalist languages which provide one and only one obvious way to do things, you'll hate this explanation, but at least you'll know why you're wrong.

state declares a lexical variable which maintains its state even after control flow leaves its lexical scope. In other words, these two snippets of code are almost entirely equivalent:

# the closure approach
    my $count = 0;

    sub add_user
        my ($user, %data) = @_;
        $data{user_id}    = $count++;

# the state approach
use feature 'state';

sub add_user
    state $count      = 0;

    my ($user, %data) = @_;
    $data{user_id}    = $count++;

The one potential difference is that the initialization of $count in the first example must take place before the first call to add_user().

If you're careful to avoid that tiny potential trap, you can achieve the same effect with the closure code. Scheme and Python and even Java fans rejoice for a moment. Okay, that's long enough.

The problem is that—just as with arguing that you don't need fold because you have a for loop with iteration—that line of thinking ignores the fact that the syntactic overhead necessary to make the former example work is too high. Adding a single keyword to achieve the same semantics and avoid that tiny little trap also makes the resulting code more expressive. It's more declarative.

There's nothing wrong with the goal of a language with a minimal feature set. That's a fine goal, but it can't be the most important goal, and it can't be a goal in isolation. That's because sometimes adding a feature lets you remove unnecessary scaffolding.

I believe that it's better to pursue concision than artificial simplicity in program and language design.

The thing about volunteers is that they don't have to do what they're doing. If you're getting paid to hang out in an IRC channel and answer questions all day, that's one thing. If you're hanging out on an IRC channel all day because you want to, that's another.

The thing about volunteers writing software is that they don't have to do it. The same goes for volunteers writing documentation or reporting bugs or asking questions about how to use or install or configure that software.

The thing about the Perl community is that almost no one gets paid solely for participating in the Perl community. Sure, you can volunteer for a while to earn the cachet and the right to apply for a TPF grant at a fraction of the going consulting rate to justify continuing to work on the unpleasant parts of a project, but you're still effectively a volunteer.

The thing about volunteers is if it's not worth their time or energy or health or sanity or happiness to keep volunteering, they can walk away whenever they want. They have no obligation to continue to do what they do. Not even their sense of devotion or duty or guilt or community camaraderie should compel them to continue on projects that aren't worth their investment of time, and that's more than okay.

The thing about volunteers is that you can't force them to do anything. You can't force them to have your priorities. You can't force them to work to your schedule. You can't force them to work on your project and you can't force them to care about what you care about. They'll do what they want to do when they want to do it and you either deal with it or you don't.

The thing about volunteers is that it's rare to have too many and it's far too common to have far too few. Thus healthy projects spend time and effort recruiting volunteers and keeping volunteers around and guiding the interests and energy and time of volunteers in productive ways, not only by making their projects pleasant and useful but by removing distractions and unpleasantness from their communitites.

The thing about volunteers is that for every one willing to take the abuse and hostility from a few people, you can't tell how many orders of magnitude more potential volunteers find that hostility and abuse so distasteful that they refuse to consider the possibility that it's worth their time to contribute.

The thing about volunteers is that if you allow certain parts of the community to fester and to grow toxic, you're well on your way to having fewer and fewer volunteers who grow more bitter and eventually become a tiny little cluster of angry, angry people who can't do anything productive.

The thing about volunteers is that it doesn't have to be this way.

Certain Perl IRC channels don't have to be seething cauldrons of rage from burned out system administrators who castigate anyone who doesn't know the secret rituals and wordings of arcane rituals to identify themselves as insiders.

Certain Perl forums don't have to devolve into arguments over whose web framework stole which idea from some other place, or whether it's clear that anyone who does or does not use one CPAN dependency or another has parents with specific unpleasant characteristics.

Certain Perl mailing lists don't have to debate whether people who work on one version of Perl or another are hateful fools whose only goal in life is to destroy everything good and sunshiney and organic.

Certain Perl blogs don't have to have comments accusing other volunteers of being liars or thieves or people of negotiable affection because said volunteers disagree on project management styles.

I suppose it's easier to destroy than to create, and it's easier to prove that you're right by demonstrating your scathing verbal wit with a keyboard, and it's easier to believe that you've won an argument if you reduce the other person to a cardboard cutout of simplistic, ridiculous beliefs. It's also easy to justify your decision to spread hostility if you can overlook the fact that the person you're castigating is a human being with complex motivations, goals, dreams, aspirations, beliefs, and emotions.

The thing about volunteers is that they don't owe you a thing.

If you want a Perl community full of hostile people who jump to hasty conclusions, who are willing to nitpick and debate the specific meaning of words than to understand what other people mean, and who are willing to throw wild accusations of crazy, hateful motives around, then you have an easy task. Just say nothing. Let it fester.

Me, I don't think that's the way to encourage a healthy community. After all, how silly is it to argue over how some other volunteer spends his or her time? Yet isn't that what we're doing?

Maybe if more of us speak up when we see this abuse and hostility, maybe we can discourage it. Maybe we can encourage people to try to understand and listen more, or at least to disagree politely if they must disagree. Maybe we can help people unwilling to be civil to find better hobbies than abusing other volunteers. Maybe we can make the Perl community and our IRC channels and our mailing lists and our forums and our comment sections places where potential volunteers want to participate because they know that we appreciate novices and we appreciate volunteers and we don't all have to do the same things or want the same things or agree on the same things to treat each other with respect.

After all, we're all trying to build great software to solve problems. Why should we borrow trouble?

I can forgive novices for writing clunky Perl code because they're following the example of far too many books and tutorials. If you date the Perl Renaissance to the year 2000 (as I do), then you can identify code written before that point and code written after that point.

If modern Perl is safer or easier or clearer or simpler or cleaner to write than legacy Perl, then it should be possible to explain how and why to use modern features in lieu of older features.

For example....

Three-argument open()

There are two forms of the open() function in Perl 5. The modern version takes three arguments: the filehandle to open or vivify, the mode of the filehandle, and the name of the file.

The legacy version has two arguments, only the filehandle and the name of the file. The mode of the file comes from the filename; if the filename starts (or ends) with any of several special characters, open() parses them off and uses them.

If you accidentally use a filename with those special characters with the two-arg form of open(), your code will not behave as you expect. This is especially a problem if you're not careful about sanitizing user input, and if any user input ever becomes part of a filename. Consider:

open my $fh, ">$filename" # INSECURE CODE; do not use
    or die "Can't write to '$filename': $!\n";

While this code appears to open $filename for writing, an insecure $filename could start with > to force appending mode, or - to open STDOUT (though I suspect you have to work really hard to force this). Likewise, code without any explicit mode in the second and final parameter is susceptible to any special mode characters.

Extracting file modes into a separate parameter to this function prevents Perl from parsing the filename at all and removes the possibility for this unintentional behavior. As Damian Conway has mentioned, using a separate file mode parameter also makes the intention of the code clearer:

open my $fh, '>', $filename # safer and clearer
    or die "Can't write to '$filename': $!\n";

The modern version of this code is safer and clearer, and it's been available since Perl 5.6.0, released on 22 March 2000. There's no reason not to use the modern version. (If you need your code to run on Perl 5.005, try a core module such as IO::Handle. If you need your code to run on older versions of Perl 5, you have my sympathy.)

Removing Friction

I migrated one of my CPAN distributions to Dist::Zilla yesterday. This seems like a little thing, but all of the Dzil plugins I use in the distribution remove one small step from managing the distribution. I don't have to update version numbers. I don't have to update the README file. I don't have to worry about copyright information, or specifying dependencies, or keeping my metadata files up to date.

I don't even have to use the PAUSE website to upload a new distribution.

In similar fashion, I migrated that distribution to Github. Moritz Lenz had found and fixed a couple of bugs. I'd already set up the distribution such that test cases were reasonably easy to add, and the code was simple enough that the fixes were fairly obvious. Moritz forked my distribution from Gitpan.

I had my own Git clone of that repo, where merging Moritz's changes took a couple of commands. Releasing a new version with those changes was almost immediate. After that, I converted my personal SVK over SVN repository to Git, made my own Git repository on Github, and cherry-picked the commits from my original fork to my new repository.

I never lost history. I never had to edit a conflict. I never had to ask Moritz to resubmit a patch. The only distribution editing I did was to remove unnecessary files from my repository and to ask Dzil to generate them.

I knew all of the existing behavior of my distribution continued to work, because all of the tests passed. I knew my distribution would work properly with PAUSE and the CPAN because all of Dist::Zilla's tests worked. I also know that if any problems arise, the difficulty of fixing them is solely the difficulty of finding and fixing the bugs, not of managing the process of fixing the bug and managing repositories or verifying previous behavior or wrangling uploads or editing metadata in files manually.

The distance between fixing a bug and distributing a new version to users has shortened, and that path is now smoother.

You don't have to use Git or Github or Dzil or Perl. Plenty of other good tools exist to manage complexity or to make complexity go away. Yet isn't that what we should do as programmers? We find barriers and difficulties and obstacles and we eliminate or minimize them, not solely because the new versions have a novelty factor or have greater elegance and aesthetic appeals, but because the relentless process of simplifying removes artificial complexity and structural scaffolding that all too often distracts from the real problems we need to solve.

After all, Moritz and I want to publish an attractive, informative, readable, accurate book about Rakudo Perl 6. Software is just the means by which we do so.

Correct or Compatible, Pick One

Try::Tiny shouldn't become a core module in Perl 5, because it works around a series of infelicities of implementation.

If Perl 5 were to address this problem, someone would have to rethink several well-established, fundamental design and implementation and language decisions. Someone would have to test them, not just with all of the core tests, but with all of the CPAN. Someone would have to document the changes and hope that the updated documentation would eventually make its way into new books on Perl 5 as well as all of the example code on the Internet. Someone would have to update existing code, especially on the CPAN, to take advantage of the new features.

This is all possible. It's happened before. Code people care about gets maintained, and if new features are useful in that code, people will update it to take advantage of new features. Similarly, features removed from a language or library get removed from programs when they update to new versions of that language or library. None of this is new and none of this is surprising.

The real problem in the case of exceptions and scope exit semantics in Perl 5 is that the initial design didn't anticipate the possible edge cases which make the feature (occasionally) unreliable. You can design all of the formal semantics of a language as much as you want. You can spend ten years writing a specification. You can prove the initial implementation with formal methods. Yet as soon as users start doing things you didn't expect, you'll run into cases the specification and design and intent don't cover, and then you have a difficult question.

If someone changed the way scope handling and exit and call stack unwinding worked in Perl 5, such that exceptions could never get lost and lexical destruction always occurred in a predictable order and return values propagated to the appropriate places, Try::Tiny would be unnecessary...

... and someone (probably that same someone) would have to provide a compatibility mechanism for all of the existing code which relies on specific details—documented and otherwise—of scope handling and leave semantics and destruction ordering as it exists in Perl 5.12 right now, to give people time to migrate to the new system and to notify them that the old system is deprecated and to find design and implementation infelicities in the new system.

If reading that paragraph wearies you, imagine how much more doing all of that would be.

Granted, fixing scope handling and destruction ordering and leave semantics is as big a task in Perl 5 (or any language) as anything else. It's doable, from the technical side. Yet how much sixteen year old code needs to change to achieve it, and how many assumptions in the billions of lines of code written in the past sixteen years need to change to make it work? Worse yet, this isn't a Moose situation, where an obvious improvement is available to anyone with a CPAN client and a few minutes to read a tutorial.

I don't mean to dishearten anyone. The edge cases for which Try::Tiny exists are rare, and you can write Perl 5 code for years without encountering them. It's an easy module to use and it shields you from most of the damage.

Even so, fixing bugs the right way—making them impossible to encounter—is not always easy. You don't often get a fresh start in software, but sometimes that's exactly what you need.

Don't Core Your Workarounds


Exception handling in Perl 5 seems easy, until you realize all of the things that could possibly go wrong between the time your eval BLOCK exits and you check the global variable $@. Fortunately, Try::Tiny hides most of the difficult details from you, so you can concentrate on writing good exception handlers without worrying about all of the special cases that may eventually confuse and concern you.

It was inevitable to see a suggestion to put Try::Tiny in the core, and the resulting discussion of conflicting goals and motivations and reasons to remove things from the core and suggestions of other things to put in the core was even more inevitable. (If you've read one thread like this before, you've read one too many.)

Yuval Kogman, the author of Try::Tiny responded the other day, saying that Try::Tiny is a band-aid, not a solution.

Here's a design principle.

Exception handling in Perl 5 is difficult to use with complete safety and correctness. You have to beware of a few strange edge conditions that, in most software, never occur. When they do occur, they're strange and difficult because of the semantics of how scopes and exceptions and destruction and call-graph unwinding occur in Perl 5.

Making Try::Tiny a core module—and recommending it as the core-approved way of handling exceptions in Perl 5—enshrines that workaround as well as the flaws of implementation around which it works. The module exists as an alternative to a proper fix at the language and implementation levels. It's a patch. It's a workaround. It's not a controversial extension to the language that some people may want and others don't. Instead, it's a makeshift that offers more safety and correctness and abstractions around those relatively unknown idioms to help people write better programs.

In the same way, signatures is a workaround for the lack of a feature in Perl 5 as much as MooseX::Declare is a workaround for the lack of succinct boilerplate-reducing features in Perl 5.

Making extensions possible doesn't relieve language designers and implementors from the responsibility of providing necessary features and abstractions.

Devel::Declare is, in general, a good thing because it allows experimentation with language features and ideas that may be useful in the core eventually, or may be useful in specific domains, or may be unsuccessful, but at least provide that data. Safety and ease of experimentation help develop communities of invention and evolution.

... but you have to recognize workarounds for what they are. Next time, I'll explain the practical consequences of this tension.

Recent performance improvements in Parrot to avoid aggressive buffer copying and to avoid unnecessary buffer reallocations demonstrate how bugs, mistakes, and design infelicities at the lowest levels of your program stack can have dramatic negative effects on the whole program.

In the case of Parrot, they also demonstrate a worthwhile experiment currently underway.

What if we forbade modifying strings in place?

It sounds crazy, but consider the evidence. Most of the strings in Parrot and Rakudo get read many more times than written. Yet because strings can change in place, many parts of the system which return strings must return copy-on-write string headers, to avoid those modifications.

For example, the Class PMC in Parrot contains a string which, if present, represents the name of that class. Given a class object, you can ask for that string. The Class must make a copy-on-write header for this operation. If it didn't, any modification performed on the string it returned would change the name of the class in place, even unintentionally.

Almost none of the uses of this introspection interface on classes throughout Parrot and Rakudo ever perform any modifications on that string. In other words, the interface prevents something rare and catastrophic from happening while penalizing the common behavior.

Certainly creating a new copy-on-write string should be cheap, and most of these strings become collectable garbage very quickly, but there's no garbage collection mechanism cheaper than not creating garbage at all.

The important change is to forbid all string modification functions from operating on buffers in place. Instead, they create new string headers return them directly. The caller has the responsibility of storing the modified string as appropriate. This pushes the allocation to the point of modification.

(Some might wonder "Why not modify the string in place and copy the old string?" That requires you to root around in memory to find all references to the existing string, and you're not going to do that quickly or safely without changing the way memory works throughout Parrot, or at least building a huge data structure to keep up to date about what's present where.)

There are two caveats to this system. First, there are legitimate performance reasons to allow in-place modification. Within a loop, appending to a single string can be much cheaper if you do allow modification: the rule about not creating and throwing away immediate garbage applies there. The right solution is some kind of StringBuilder container which allows in-place modification with immutable strings. (The JVM did, eventually, get this right.)

The second caveat is that high-level languages may support mutable string semantics. Yet a similar approach works for this as well; a high-level language should use a PMC as a container for primitive strings. The contents of that PMC can change, but if everything refers to that PMC, it effectively appears to change in place.

I expect measurable performance improvements from this experiment, somewhere between 5-10% on Rakudo benchmarks. Not only does this create far fewer garbage headers to collect (which decreases the amount of time spent in garbage collection), but it allows more pervasive sharing of constant strings. We have another experiment in progress to coalesce all identical strings into one representation, and that has great memory savings as well.

This experiment won't make it into Parrot 2.3, due on 20 April 2010, but we should have performance numbers by then. If it's suitable for merging, it should be available for Parrot 2.4 as well as Rakudo Star.

The Tyranny of Reifying COWs explained how fixing an overzealous memory copy in Parrot made Rakudo Perl 6 use much less memory. Unfortunately, it also slowed Rakudo substantially.

My favorite tool for profiling is the combination of Callgrind and KCachegrind. Callgrind emulates a CPU running the actual binary and gives reports about the number of instructions executed, branches taken, and call paths through a program. KCachegrind provides multiple ways to visualize and display this information. The visualization is perhaps the most important part; I often compare two or more runs through a program to see where optimizations have the most effect. (I'll never return to wallclock-based benchmarks—they're laughably irrelevant in their inaccuracy.)

Some focused profiling with this combination revealed that the big difference before and after the memory fix came from the code path when concatenating two strings. This is in the Parrot function Parrot_str_concat(). If both strings exist and have contents (if neither string is empty), the code copied the first string into a result string, then appended the second string to the result. That copy operation performed a reified copy on write.

I was familiar with that code, as that's where Vasily and I found and fixed the first bug. After digging into the mess that is Parrot_str_append(), the slowdown was obvious.

Before our change, reifying a COW string copied the entire buffer. If a substring pointed to five characters in the 90kb buffer representing the entire source code of a program, the copied string would get a new 90kb buffer. Appending another few characters to that string requires only copying a few bytes into the new buffer and changing the buffer length used member of the string header.

After our change, the copied string's buffer was large enough only for the contents of the string itself. If that were five characters from the 90kb file, the copied string's buffer would be five characters in length. Here's the problem: appending anything else to that string meant reallocating that buffer immediately.

That immediate reallocation slowed Rakudo measurably, perhaps by a factor or four.

Vasily and I had the same reaction when we realized what was happening. Now instead of creating and immediately reifying a COW string, we create a new, empty string with the proper buffer size, then append both strings to it. Avoiding that reallocation—a completely unnecessary operation, as we have all of the information necessary to allocate a buffer of the proper size—sped up Rakudo once again, this time even faster than before all of these memory shenanigans.

These two commits reinforce two lessons. First, optimization is the art of avoiding expensive recalculations of data you already know. Second, go faster by moving less memory around. People using Parrot and Rakudo shouldn't have to know the internals of how Parrot manages memory, but a healthy understanding of the philosophy of its mechanisms can help you write faster, leaner programs.

These two commits also argue for an internals change I've long considered useful in Parrot, and Vasily has already begun to experiment with it. It should be invisible to users of Parrot, except that your programs should run faster and use less memory. I'll explain that next time, in the conclusion of this series.

If your language has a split header/buffer system to represent strings, and you support mutable strings, you probably have a copy-on-write system. Copy-on-write (or COW) strings help you avoid making copies of buffers until necessary.

Given a 90kb file containing the entire source code of a program, it's likely the compiler, parser, runtime, and everything else has many, many strings pointing to various parts of the program. If nothing ever writes to any of these strings, they can all share the same buffer. You need a separate string header for each substring, but you can get away with a single buffer.

Parrot (and by extension, Rakudo Perl 6) do this.

When you make a copy of a string, perhaps as a substring operation but also for some other reason, you allocate a new string header, but you copy the buffer pointer directly. Then you update a flag in the new string header indicating that any modifications to that string need to make their own copies of the buffer, rather than modifying it in place. This prevents you from modifying a buffer to which other string headers point.

This is all well and good. Unfortunately, there was a bug in Parrot—not just a typo, but a deliberate bug.

The code which performs the actual copy portion of COW in Parrot checked for the COW flag, looked at the contents of the string header, and then copied the entire buffer into a new buffer. If you have a 90kb buffer representing the entire source code of your program and you have several dozen strings each representing a token in the parser sense, and if you want to modify those tokens, Parrot would allocate another 90kb buffer for each string.

Worse, a comment in the code said "Let's copy the entire buffer."

That's obviously wrong behavior, but the right behavior isn't as simple as it seems. Obviously it's important to copy only the relevant substring of the buffer before making modifications. Yet when the specific encoding of the buffer isn't the simple one-character-per-byte you might expect if you've never worked with anything more complex than Latin-1, you have to be careful about blindly copying memory around. Sometimes bugs, even deliberate ones like this, paper over other problems elsewhere.

When Vasily and I fixed the encoding problem, memory use when bootstrapping Rakudo dropped by two thirds. Unfortunately, performance suffered dramatically—but now that it was possible to build Rakudo again on machines with less than 2GB of memory, we decided it was better to build slowly than not at all, at least until we found the performance culprit.

That's a story for next time. In the meantime, very clever readers will have deciphered the subtext in these entries and the title and have probably already figured out what went wrong and why.

At last month's Portland Perl Mongers, the performance discussion came up. "Isn't it faster to write the bottlenecks of your application in C/XS?" someone asked.

Therein lies a pervasive myth of dynamic languages. It's not always faster to write in C. In my experience contributing to Parrot, the more data you pass back and forth between your high level language and C, the slower things get. That is to say, reducing memory usage is as important to performance as anything else. (This assumes you've chosen the right data structures and intelligent algorithms.)

This came up recently, when Vasily Chekalkin and I committed two large performance improvements for Parrot visible in Rakudo Perl 6. Compiling the bootstrapped portion of Rakudo used steadily more and more memory. On my laptop, it topped out at 1.5 GB. Clearly this was too much.

When we fixed that problem, it compiled in 250 MB, but it took four or five times longer to compile. We fixed that problem too, and in so doing demonstrated that the effective use of memory is as important to performance as almost anything else.

First, you have to understand how strings work in Parrot.

Shared Buffers

A string in Parrot is two data structures. One of them is the string header, which contains information about the string's character set, its encoding, its length, and some flags which track constantness and copy on write information. The other data structure is a buffer, which represents a contiguous chunk of data. A string header points to a buffer—actually a location within a buffer, as the header points to the starting point of the string within the buffer and contains string length information.

You've probably already figured out that multiple string headers can share the same buffer. Buffers have reference counts so that garbage collection works properly. (You don't have to use a reference counting scheme, but it's much easier to manage this appropriately for a small system like this, where only a few places need to update reference counts and where precise destruction is useful.)

Sharing buffers tends to mean using less memory overall. It makes taking substrings cheap, which is very useful when parsing large documents, such as the Perl 6 bootstrapping source code.

If your system also supports mutable strings, you can also perform copy on write (COW), where multiple string headers can point to the same buffer and the appropriate contents of the buffer get copied to a new buffer only when you modify a string in place.

I wrote that paragraph correctly. That's not what Parrot did, which is why building Rakudo used so much memory. That was the source of the first bug that Vasily and I fixed, and it inadvertently hid the second bug that fixing the first bug exposed.

Several Parrot hackers besides myself have come to the conclusion that Parrot should consider using immutable strings instead of mutable strings. That solves other problems.

I'll write more about the two bugs we fixed next week, as well as what we hope to gain with immutable strings. In the meantime, very careful readers can amuse themselves by speculating about what Parrot did, why it was wrong, and why the second bug was so annoying.

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

sponsored by the How to Make a Smoothie guide



About this Archive

This page is an archive of entries from April 2010 listed from newest to oldest.

March 2010 is the previous archive.

May 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Powered by the Perl programming language

what is programming?