The Unix Way and Regular Expressions as a Last Resort

I dislike administering systems. If all I ever had to do were to type apt-get update and have all of my system administration done for me, that would be fine. Unfortunately, I have to administer systems now and then.

Fortunately, the free software world has a lot of people in the same situation, and a lot of smart people have written useful software to manage their systems. As a case in point, consider fail2ban, which I'd have had to invent if it didn't already exist. fail2ban watches log files for suspicious patterns and sends traffic from the offending IP addresses to a blackhole. For example, if some malicious remote machine in a botnet comes knocking at your SSH server with a dictionary full of usernames, fail2ban will let the kernel silently drop all network traffic from that machine for an hour after the third failed login.

That's all configurable. In fact, you can configure all of the existing rules and add new rules yourself.

I did that the other day on a client's server. Somehow, the Internet at large had decided that a web-based system administration service called phpMyAdmin was running on the server. That meant thousands of attempts to find dozens of versions of phpMyAdmin. (I assure you—there is no PHP running on that machine. phpMyAdmin has security holes? Who would have guessed?) That meant a lot of wasted resources and a lot of useless entries in the log files. (We hadn't yet made it around to monitoring log files for reporting yet, so it was worse than it should have been.)

"Self," I told myself. "You should add a fail2ban rule to detect phpMyAdmin scans and drop that traffic."

I did. It was more difficult than it should have been.

fail2ban uses regular expressions to find individual entries in log files which represent suspicious access patterns. One line in a log file represents one event. This is the Unix way. This has been the Unix way for 40 years. It's been the Unix way for 40 years for one reason: it works pretty well, for the most part. (I like Unix, but I see its flaws sometimes.)

The web application I intended to secure has an administrative interface available from /admin. This makes sense. One of the places you can install phpMyAdmin is also to /admin. This also makes a certain amount of sense.

The routing system in the client's web application redirects all requests under the Admin controller (the code counterpart to /admin) to a catchall action so as not to expose internal details of what is and isn't available with or without specific authentication credentials. This makes sense when I think about it one way and doesn't necessarily make sense another way. (It's not entirely what someone might call RESTful and it's almost certainly a violation of the HATEOAS concordat. Then again, it's an administrative interface hidden from the Internet at large behind authentication credentials.)

The first version of my regular expression looked for all attempts to access /admin, /phpmyadmin, /PhpMyAdmin, et all which resulted in a redirection.

Of course, /admin also redirects real users with real web browsers to /admin/login to give them a chance to use a login mechanism that's not nearly as hateful as the basic authentication dialog that's been largely unchanged in web browsers since 1994. (You remember 1994. That's before PHP existed and before Windows machines were on the Internet in such droves that it made sense to gather a huge botnet of poorly secured Windows machines to search for phpMyAdmin vulnerabilities. Also you could have bought AAPL at a deep discount compared to now.)

Unfortunately, my first regular expression matched users going to /admin and getting redirected to /admin/login just as well as it matched bots going to /phpMyAdmin and getting redirected to an error page.

I changed the regular expression. We could also have made /admin display a login form to an unauthorized user. We could have done a lot of things. I changed the regular expression.

The next day, I realized the problem was that the standard Unix mechanism of logging plain text in a well-understood format and parsing it with regular expressions (or even a grammar) threw away information and tried to reconstruct it badly. At the point in the web application where the router received a remote request and redirected it, the router knows exactly why it is redirecting the request. It knows that /phpMyAdmin is an invalid route. It knows than an authenticated user requesting /admin should get redirected to the administrative dashboard. It knows that an unauthenticated user requesting /admin should get redirected to /admin/login.

Unfortunately, none of that reasoning gets into the Apache httpd-style log file. It gets a datestamp, an IP address, the URL request path, and an HTTP status response code. From there, fail2ban and the regular expression guess at why that log entry is there.

Guessing what semi-structured data means is unreliable.

Fortunately, fail2ban is a good Unix program and is flexible about which log file it scans. I could add another log file to the web application to write entries only when something makes a request for a path that's completely unknown; if there's no controller mapped to the request path prefix /phpmyadmin, write to the log. That's only slightly more difficult to create and to configure than it is to explain. You probably already know how to do it already.

Unfortunately, writing a separate log file only works around the problem. I still have to write a regular expression to parse lines in that log file so that fail2ban will handle them appropriately. That's the Unix philosophy at work. It works pretty well and it's worked pretty well for decades. Sure, there are ambiguities, but you can work around them pretty well too.

Sometimes, though, I tell myself what I think I want is the ability to send structured data as events to a centralized event listener system to which other processes can connect as listeners. I know there are things like systemd and D-Bus in the freedesktop.org specification, but I rewrote the regex because pretty well gets the job done now and I don't expect this system to last 40 years.

(In fact, that sums up Unix pretty well too.)

The Unix Way and Regular Expressions as a Last Resort

Tags:

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry