With multiple new Perl books in progress as well as some fiction and other books, I've spent a lot of time lately working on our publishing process. Part of that is building better tools to build better books, but part of that process is improving the formatting based on what those books want and need to do.
It's a good thing I know how compilers work.
(If you're reading this and you're a self-taught programmer, you're in good company. I took a BASIC programming class in 1983 and a typing class in high school, and that's all the formal education I have. Everything else I've learned through stubborn experimentation over the course of several years. I can't promise that you'll have the same opportunities I've had, but I can promise that the learning material exists.)
I've spent a lot of time customizing Pod::PseudoPod to enable additional features, and much of that time I've spent working around the limitations of its processing model.
POD is a line-oriented format. It's reasonably simple. A heading is
=head1 in the left-most column. A paragraph is a chunk of text
separated by one or more newlines from another chunk of text.
Then you get to things like lists:
=over 4 =item * A list item =item * Another list item =item * A third list item =back
The lack of ending tags means that a parser has to have heuristics to decide when something ends. Sometimes that works, and sometimes it doesn't.
P::PP its emitters tie together the acts of recognizing the start and end of semantic elements (headings, text, lists, list items) and emitting a different representation of those semantic elements (HTML, XHTML, LaTeX). That is to say, the method called when the parser determines that a list item has ended may well emit a chunk of XHTML.
When extending and modifying this emitter, you must be very careful not to rule out possibilities you hadn't thought of (can you nest lists in table cells?) while not breaking existing code. Eventually the rules for handling literal sections, where whitespace is significant, set flags read in the rules for emitting code to delay emitting until the literal section has really, truly ended—the system which was always a state machine has become an ad hoc state machine with its rules spread all over.
In other words, it's like just about every other ad hoc parsing system that's grown too big for its initial design. The mechanism of emitting a transformed representation dictates too much policy of how that representation should work.
P::PP has an advantage in that emitters are objects, so you can override specific methods and customize behavior, but as the state machine transitions spread like non-native kudzu in the American southwest, overriding becomes as much about not doing something as about doing something. That's a really bad sign.
It's a good thing I know how compilers work.
A compiler works by parsing source code. Every character in the source code is part of a lexical component: an identifier, an operator, a declaration. A grammar groups together lexical items into rules which determine which programs are valid and which contain syntactic errors.
The result of running source code through a (modern/efficient/effective/non-toy) grammar is some sort of tree structure. A tree has the interesting property that you can traverse it from its single root node and evaluate it to a result. That is to say:
say 'Hello' . ' world!';
... could become a tree where the root is the
its only child is the concatenation operation, and its children are the two
constant strings. To evaluate the program, you descend depth first until you
can descend no further, then evaluate the leaves of that tree in a defined
order and walk back up the tree.
You can represent a document in the PseudoPod sense in the same way. A book contains chapters which contain sections. A table contains rows which contain cells. A cell may or may not (depending on the formatting rules for your document) contain headers or lists or other formatting.
If Pod::PseudoPod built a tree of objects in a document object model and emitters visited those nodes and produced their transformations, producing the correct transformations would be easier.
You can still get the correct results with the existing approach (or a whole suite of regular expressions or XSLT or... whatever), but one of the nice parts about learning a little bit more about computer science is finding that the solution to one problem helps solve many, many other problems, even in domains you might have thought were unrelated. If you don't believe me, read Steve Yegge's Rich Programmer Food.
The only problem for me is that I hate writing parsers.