I spend a lot of time profiling and optimizing Parrot and Rakudo. Even though we have added several features to Parrot since the 1.0 release in March and the 1.4 release in July, we've improved speed and memory use by measurable amounts. We haven't yet reached the point of powerful, orders-of-magnitude optimizations, but we're preparing for them.
I've written before that I believe the single optimization principle for writing a compiler and runtime for a dynamic language is to replace as many aspects of genericity and flexibility with straight-line code as possible. One optimization from yesterday demonstrates this in a dramatic fashion.
The oofib.pir benchmark exercises method calls and argument passing. It also creates garbage collectable objects. It's a little bit heavy on math operations for a general purpose benchmark, but I like that it performs frequent invocations. Optimizing those helps almost every real program, as does improving the garbage collector.
I normally perform profiling work with Callgrind. I've not found a tool more useful to count raw instruction counts in the processor. Yesterday I tried Cachegrind instead. Both tools are usable with the KCachegrind visualizer, making them more useful. Where Callgrind measures the instructions actually executed, Cachegrind measures the processor cache behavior of the program.
As modern processors are so much faster than memory (even fast processor caches), sometimes it's worth trading a few cycles for fewer cache misses. Similarly, as modern processors can have deep pipelines and perform speculative execution, rewriting code to avoid branch mispredictions can have a positive benefit on performance as well.
Branch prediction is a processor feature which analyzes a branch condition
if condition, for example -- and predicts where execution
will go. Speculative execution performs the operations of that expected branch
even before the processor knows that its prediction is correct. If it's
guessed correctly, there's no penalty and a big improvement over having to
wait. If it's guessed incorrectly, it has to throw away work. It's a risk,
but it usually guesses correctly.
Sometimes it needs help.
Parrot's default garbage collector is an optimized-but-clunky stop-the-world mark and sweep collector. We wanted to replace it with a concurrent collector with better throughput before Parrot 1.0, but the time and expertise necessary to refactor the internals to make it possible to change the GC without breaking the code for everyone else was a larger investment than we could produce in time for the 1.0 release. (We're in much better shape now.)
Our garbage collector tracks two similar-but-different types of objects: STRINGs and PMCs. Think of a STRING as a chunk of binary data with an encoding and a length and a buffer. Think of a PMC as an object. A STRING is a single, self-contained unit. A PMC may contain attributes which refer to other PMCs. There are finer distinctions between them, but that's all you need to understand now.
To simplify our GC, both STRING and PMC have a shared header called PObj. This structure contains the garbage collector flags: what type of PObj is it (STRING or PMC), is it live or free, does it have a custom mark behavior, does it have custom destruction behavior. Note that both STRINGs and PMCs share the first two flags, while only PMCs have the latter two.
A stop-the-world mark and sweep collector starts from a root set of objects that the runtime knows are still alive. It traverses that root set, marking everything as a live and recursing into any PObj reachable from that root set, recursing into any PObj reachable from that set, and so on, until it's marked the entire graph of reachable objects as alive. Then it sweeps through the pools containing all allocated objects, freeing anything not marked as live and clearing all flags.
There was a single "mark this PObj as live" function. Whether we wanted to mark a STRING or a PMC, we called the same function, casting the object into a PObj.
The astute can pause here to say "Yes, of course, you're throwing away information there!"
Parrot r41447 added separate functions to mark PMCs and STRINGs as live to let the C compiler tell us if we made any mistakes about what we marked as live. (We've had a couple of bugs from expecting the wrong thing.)
My Cachegrind profile showed a huge amount of branch prediction misses in the PObj marking function. Specifically, the processor could never predict whether it was marking a STRING or a PMC. As you might expect, marking a STRING as live means only setting a flag, while marking a PMC requires checking if it has custom mark behavior and potentially recursing. There's no way to predict which path through the live object graph the mark phase will take, and there's no way to predict whether the branch predictor will see a run of PMC, PMC, PMC and get on a good train of prediction or whether it'll flip back and forth between PMC, STRING, PMC and continually be wrong. (The mark phase is deterministic for any code without random allocations, but there's too much data to predict any pattern.)
As we now had separate functions to mark each, I pulled the guts of the single mark function into the two helper functions... and reduced the number of branch mispredictions by over 70% in that benchmark.
For a further optimization, I realized that there's no need even to call the marking function for STRINGs, at least for code in the core of Parrot which can flip the flag directly.
The end result is a little bit more code -- not much, maybe a dozen lines -- but a huge increase in clarity, an improvement in simplicity, and better optimization and performance characteristics. The compiler now helps give us warning messages where it matters most (correctness), and we get better performance at a level normal profiling can't even see.
I even have the temptation to call this pattern "Replace Conditional with API."