Google, Perl Tutorials, and the Tyranny of the Extant

| 8 Comments

If you use a search engine to find a beginner's Perl tutorial, you're more likely to find a lousy Perl tutorial. (Perl Beginner's site is a good place to start.) The problem isn't Perl as much as it is a systemic problem with modern search engines.

Summary for skimmers:

  • New doesn't automatically mean better
  • Best is a term with necessary context
  • The popular has a tyrannic inertia
  • The solution isn't as easy as "Just publish more!"

If you remember the early days of the web, Yahoo's launch was a huge improvement. Finally, a useful and updated directory to the thousands of new websites appearing every month! Then came real search engines and search terms, and we started to be able to find things rather than navigating hierarchies or trying to remember if we'd seen a description of them.

(It seems like ages ago I managed to download 40 MB of scanned text of ancient Greek manuscripts to create my own concordance for research purposes, but this was 1996.)

Then came Google, and by late 1998 it had become my most useful website. The idea behind PageRank was very simple (and reportedly understood by a few other large companies who hadn't figured out what to do with it): people link to what they find useful. (Certainly I oversimplify PageRank, but you can test current versions inductively to see that it still suffers this problem.)

PageRank and Wikipedia have the same underlying philosophical problem: reality and accuracy are not epiphenomena arising from group consensus. ( An epiphenomenist or a full-fledged relativist might disagree, but I refute that by claiming I was predestined to believe in free will. Also Hegel is self-refuting, so there.)

PageRank's assumption is that people choose the best available hyperlink target. (For another example of the "rational economic actor" fallacy, see modern economics.) This is certainly an improvement over manually curated links, but without a framework for judging what "best" means in the author's intent or the author's historical context at the time of writing, PageRank users cannot judge the fitness of a link for their own purposes.

(While I'm sure some at Google will claim that it's possible to derive a measurement of fitness from ancillary measures such as "How many users clicked through, then performed a search again later?" or "Did the search terms change in a session and can we cluster them in a similarity space?", you're very unlikely to stumble upon the right answer if the underlying philosophy of your search for meaning is itself meaningless. The same problem exists even if you take into account the freshness of a link or an endpoint. Newer may be better. It may not be. It may be the same, or worse.)

In simple language, searching Google for Perl tutorials sucks because consensus-based search engine suckitude is a self-perpetuating cycle.

Wikipedia and Google distort the web and human knowledge by their existence. They are black holes of verisimilitude. The 1% of links get linkier even if something in the remaining 99% is better (though I realize it's awkward to use the word "better" devoid of context, at least I let you put your own context on that word).

It's not that I hate either Google or Wikipedia, but they share at least one systemic flaw.

Certainly a fair response of my critique is that a concerted effort by a small group of people to improve the situation may have an eventual effect, but I'm discussing philosophical problems, not solutions, and even so I wear a practical hat. A year of effort to improve the placement of great Perl tutorials in Google still leaves a year's worth of novices reading poor tutorials. (At least with Wikipedia you can sneak in a little truth between requests for deletion.)

Of course this effort is worth doing! Yet I fear that the tyranny of the extant makes this problem more difficult than it seems.

Edit to add: there's no small irony in that the tyranny of the extant applies to some of the Perl 5 core documentation as well. I saw a reference to "You might remember this technique from the ____ utility for VMS!" just the other day.

8 Comments

I'm a web developer, and the least enjoyable aspect of my job is clients complaining about their Google rankings. I can't really blame them: in some cases, a sudden and unexplained drop in rankings has turned profitable sites into money pits. A site owner who follows Google's recommendations to increase the number of Adsense units on his pages suddenly loses 80% of his traffic, and when he asks for help on Google's own support forum, their volunteer support people tell him he's spamming his pages with too many ads.

I don't envy Google their task -- no matter how sophisticated they make their ranking algorithm, people will find ways to game it, and some fields need a different algorithm than others. For instance, when ranking web sites about Julius Ceasar, it makes perfect sense to give the oldest ones with the oldest incoming links a boost in the rankings. Those are the most established, about a topic that's not likely to change much. On the other hand, when it comes to pages about a still-developing programming language, the age of a page and its links should probably be a negative attribute. It it hasn't changed or gotten any new links recently, it might be entirely useless.

The success of social media like Twitter has intensified this problem for Google. They want to compete by having the "latest buzz" in their results when it comes to people searching for current topics like news, movies, celebrities, and the like. So they want to give brand-new content a boost for those things, but they don't want established, quality authority pages bumped down the rankings by random blog posts, either.

All this means that there isn't much you can do about search engine rankings. I've spent a lot of hours researching SEO, and my main conclusion is that it's an unpleasant, frustrating, near-hopeless task. Yes, there are some things you can do, at least to make sure your site isn't being *hurt* in the rankings, like making sure your pages have relevant titles and headings, and that your links between pages use relevant link text. But if you have a site that's been around for a while, is already indexed and has some incoming links, doesn't have any serious SEO no-no's, and is being beaten in the rankings by a number of other sites, there's not much you can do about it. More incoming links can help, and you can put a ton of hours into a social media-driven link building effort (though links from social media sites don't have as much value anymore). And then they might change their algorithm because too many sites are using those methods, and all your work is for naught.

If there are out-of-date perl tutorials that consistently top the rankings, the odds of beating them with SEO efforts are probably much worse than the chance of convincing the owner to take down the page, or replace it with links to updated tutorials, or even host an up-to-date tutorial provided by someone else. If a human being can be tracked down who controls that site, that's a better bet than convincing the search engines to stop liking it.

I wonder if it's possible to use a site with already high SEO rankings to overcome this, at least until a longer more concerted effort takes places.

I'm thinking of something like StackOverflow.com which gets great rankings in search results. So maybe someone should ask a question like "What are some good online Perl tutorials" with some extra keywords like "beginner", "learn", etc in the body of the question.

Then people can answer with some helpful links and hopefully if this follows the trend of most other SO questions I've seen, it will be highly ranked by Google.

That could work, but it's important to link to a single URL which can receive updates as the language improves. Linking to that Perl.com tutorial from 2000 doesn't help anyone; I'd like to avoid making the same mistake for 2022.

The problem with Google page rank is that once Google become the search engine used by almost everybody, the feedback loop just broke the algorithm.

It's like a huge bad tuned Ant Colony Optimization algorithm!

Once the top links for some search are established they become almost set in stone. It doesn't matter if something better appears because the "ants" will not bother following anything without the pheromone smell of the top links show on any Google search. And they will only left pheromone (link) on the same top links.

Seems like if the whole Perl community linked to the best stuff from their websites … I think it's technically called a googlebomb, but in this sense it would be more like: linking to useful information.

That may help in the short term. If those links point to tutorials which received updates, we could solve the problem over a longer period. (If not, we'll have to repeat the process in a couple of years.)

(I'll try to keep it shorter this time; sorry for the extended SEO rant earlier.)

The more the Perl community links to the best up-to-date docs and tutorials (and removes links to out-of-date ones), the better Google's results will be. That's how it's supposed to work. Google has put measures into place to try to detect and counter "googlebombs", by sending up a red flag if a page that hasn't had a link in months suddenly gets a hundred of them, for instance. Of course, a page can get "googlebombed" for perfectly valid reasons, like a news story suddenly making an obscure page relevant again, so it's more sophisticated than that. The algorithm tries to look at where the links are coming from, whether they're bunched up from sites that may belong to the same person or company (are they hosted at the same place; do they share domain registration details?), and other signs that the sudden burst of links may not be a legitimate reflection of interest. (No one outside Google knows exactly how all this is calculated, but we know from interviews and testing that it's done.)

So I wouldn't worry that a concerted effort by the Perl community to build links to the best sources would have a negative effect. It's unlikely that it would send up "bomb" flags the way an automated spam-linking campaign would. As long as the links are coming from different people on different sites, linking with whatever text makes sense to them, surrounded by different relevant content, it'll be fine. (You wouldn't want everyone to use the same link text within an identical paragraph, though, or to have all new links to page A coming from site B with none from anywhere else.) If it's natural, it'll look natural.

That may sound like it contradicts my rant earlier, but not really. Relevant, organic links are always good. SEO-driven link-building campaigns that use techniques like link wheels and reciprocal link schemes to scatter lots of links around on sites where they can be placed automatically for free (which is what SEO experts will frequently recommend and charge plenty to do for you), are not so good, and likely to become less so as Google continually tunes the algorithm to devalue them. Best to stick to the former.

Hi chromatic!

Thanks for the links, and I agree with your insights.

Modern Perl: The Book

cover image for Modern Perl: the book

The best Perl Programmers read Modern Perl: The Book.

sponsored by the How to Make a Smoothie guide

Categories

Pages

About this Entry

This page contains a single entry by chromatic published on October 26, 2011 11:58 AM.

What Perl 5's use Really Does was the previous entry in this blog.

Modern Perl: The Book, 2011-2012 Edition Draft is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.


Powered by the Perl programming language

what is programming?