Programming Breaks Things

Computer scientist Edsger Dijkstra famously said "It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration."

I disagree, in principle and in practice. (I disagree so strongly that I work on a project to teach programming to children.)

I believe it's almost impossible to teach programming to someone who hasn't experienced what we USians call "Geometry". That's mathematics: not the specific behavior of triangles and angles and their relationships, but the hard work and creativity and even beauty of following a set of logical rules to a desirable conclusion. People who can do that can program effectively. People who can't do that will struggle.

Before you can solve a big problem, you have to break it.

One of my work projects is a document categorization system. I've written before that it uses a pipeline processing model, where a document moves through the pipeline in various named stages. One stage might be "NEW", while another might be "EXTRACT METADATA". As the system runs, documents make their way through the pipeline in various stages and eventually enter a search index and an archive intended for users.

Documents come from various places, and it's possible for identical (or near-dentical) documents to enter the system at various times. I've long had an exact title match filter as a first approach to remove duplicates, but it's never filtered out enough duplicates. (Some documents are essentially press releases barely edited and republished by multiple news organizations. These documents are almost never interesting and are frustrating in their sameness within the archive, but in the system they go regardless.)

We've talked about several approaches to finding duplicate and near-duplicate articles, with everything from heuristics to identify title similarity to maintaining multiple latent semantic indexes for each unique category of documents. I dragged my feet on the latter because documents expire after 90 days, and managing an n-dimensional corpus search space where one of those dimensions is also time was more work than I wanted.

Wednesday I realized that a naïve approach could give really good results while being easy to code and, more importantly, very quick to run. I coded and deployed it yesterday, and tuned it and deployed an improved version as I was writing this very paragraph. I added a new processing stage which makes a word histogram for every new document entering the system and compares those histograms to existing articles. If they're similar enough, the new article gets invalidated before it enters the search index or undergoes any further processing.

It's silly, but it works. It's 108 lines of code, per sloccount.

I realized something while writing it: programming is breaking things.

Long years of programming experience have taught me that most problems are too big. Most functions are too long. Most methods are too long. Most entities in the system do too much.

If you read much novice code, you see long functions (if you see functions at all) with deeply nested conditionals and mutable state mutating all over the place, because a variable at the top of the program gets used all throughout the entire program. You see a mess, and you see a maintenance burden, and you see someone flailing to control something that's grown way out of hand.

(You see this in part because people trying to learn how to program are also learning the syntax and semantics of a programming language, and until you know the vocabulary rules, you're going to have trouble understanding nuance of meaning and metaphor and idioms.)

I had no trouble writing this code in in the small because I know the tools Perl provides for me: hashes and arrays and methods.

I had little trouble writing this code, because I understand the pattern of fetching a document at a time from an iterator and processing it to get a histogram and putting that histogram in an array for later processing.

I had an easy time testing this code because I know how to write testable code: each of my methods has a well-defined input and a well-defined output and I can test only at those boundaries to see what happens.

Even though you don't know the details of this system, if you're a decent programmer, you can probably write an outline of how the code works just from how I've described it already:

Get a collection of all active extant documents
Iterate over them
Fetch a histogram of each
Get a collection of all new documents
Iterate over them
Fetch a histogram of each
Compare each to every document in the histogram array
Invalidate the document if it matches any histogram too closely
Add the document's histogram to the array

You can probably guess the names of my methods. If you're not exactly right, you're close.

This is the discipline and experience that sets a good programmer apart from a novice. Sure, a novice (or an undisciplined programmer) could write twice as much code to do the same thing and get it working. Maybe he or she could write four times as much code. (I don't pretend that my factoring of this code is the rightest way to do it, but I do know that it passes multiple tests.)

That's my writing in the small. My writing in the large is even more interesting.

Each stage in the pipeline is its own self-contained class. I call them app classes. Every app class conforms to an interface and gets run by a runner. Every app connects to a defined logger and performs its own registration and reporting.

Every app has a method to fetch its basic resultset (every app is part of a processing pipeline; obviously it's going to iterate over documents in a certain state). Every app has method hooks to fire before this iteration and after it. Every app has a process() method which performs the iteration.

I've extracted and formalized the thirteen app classes over the past several months. They started as a series of individual scripts. Then they had a common base class. Now they share code with roles, take configuration out of a common configuration file, and register themselves when loaded as plugins. They can run separately (great for testing) or all together (as is normal).

I knew from the start that I was writing suboptimal code I'd eventually have to change, but that's because I didn't know enough about the problem yet. I'd discover that as the project went on. I'd gain more insight as I saw what kinds of documents we'd have to handle (and how very strange some of them are compared to what we expected).

The original concept of refactoring always reminds me of math. We rearrange things to make them clearer, to prepare us to do other work, or harder work, or at least further work. It's not change for change's sake, and it's not change to add or remove or modify behavior. It's nothing more or less than changing the design of things without changing their behavior.

It's the same skill, from writing functions of the right name and size to putting modules in the right places with the right contents. It's about breaking big things into smaller things. It's about breaking things into the right things.

(Dijkstra is right that BASIC affords few abstraction possibilities to break programs into effective and distinct components, but for novice programmers the experience of turning what seems like a simple task into the steps required to accomplish it is an important experience. That's also one reason why Modern Perl: The Book uses small test programs to demonstrate language features: working in small steps is too important to ignore.)

Tags:

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry