My favorite tool for profiling is the combination of Callgrind and KCachegrind. Callgrind emulates a CPU running the actual binary and gives reports about the number of instructions executed, branches taken, and call paths through a program. KCachegrind provides multiple ways to visualize and display this information. The visualization is perhaps the most important part; I often compare two or more runs through a program to see where optimizations have the most effect. (I'll never return to wallclock-based benchmarks—they're laughably irrelevant in their inaccuracy.)
Some focused profiling with this combination revealed that the big
difference before and after the memory fix came from the code path when
concatenating two strings. This is in the Parrot function
Parrot_str_concat(). If both strings exist and have contents (if neither string is empty), the code copied the first string into a result string, then appended the second string to the result. That copy operation performed a reified copy on write.
I was familiar with that code, as that's where Vasily and I found and fixed
the first bug. After digging into the mess that is
Parrot_str_append(), the slowdown was obvious.
Before our change, reifying a COW string copied the entire buffer. If a substring pointed to five characters in the 90kb buffer representing the entire source code of a program, the copied string would get a new 90kb buffer. Appending another few characters to that string requires only copying a few bytes into the new buffer and changing the buffer length used member of the string header.
After our change, the copied string's buffer was large enough only for the contents of the string itself. If that were five characters from the 90kb file, the copied string's buffer would be five characters in length. Here's the problem: appending anything else to that string meant reallocating that buffer immediately.
That immediate reallocation slowed Rakudo measurably, perhaps by a factor or four.
Vasily and I had the same reaction when we realized what was happening. Now instead of creating and immediately reifying a COW string, we create a new, empty string with the proper buffer size, then append both strings to it. Avoiding that reallocation—a completely unnecessary operation, as we have all of the information necessary to allocate a buffer of the proper size—sped up Rakudo once again, this time even faster than before all of these memory shenanigans.
These two commits reinforce two lessons. First, optimization is the art of avoiding expensive recalculations of data you already know. Second, go faster by moving less memory around. People using Parrot and Rakudo shouldn't have to know the internals of how Parrot manages memory, but a healthy understanding of the philosophy of its mechanisms can help you write faster, leaner programs.
These two commits also argue for an internals change I've long considered useful in Parrot, and Vasily has already begun to experiment with it. It should be invisible to users of Parrot, except that your programs should run faster and use less memory. I'll explain that next time, in the conclusion of this series.