Saturday, May 05, 2007

Calling All Command Lines

When programming in Perl, I make extensive use of the command line. For example, to get the current directory, use qx[pwd]. (Don't forget to chomp() the result).

A more significant example, I use command line curl to do all my web IO: qx[curl --silent '$url'].

You might ask, why use qx[curl ...] rather than libcurl? For two good reasons. First, curl is usually installed, and libcurl is not. ("So type cpan install libcurl. What's the big deal?" The big deal is that many people have never run cpan, and that first time experience running cpan is so bad that it's a non-starter.)

The second and more important reason is that the UI for curl is well understood, simple, and well documented, while the UI for libcurl is different, complex, and well documented. Yes, it's well documented, but it's different and complex. Why should anybody who knows the curl UI need to start over learning the libcurl UI (also called the API)? The one possible reason is performance, and you'll be hard pressed to convince me that there will be any noticeable difference.

Thus, I use

my $html = qx[curl --silent '$url']

to pull HTML pages.

Almost. One problem using the command line: you need to quote properly. What if $url contains certain characters that cause problems? After all, qx feeds the command to the shell for processing. And shell processing is complicated. But if you read the bash man page, you'll see that it's not so complicated.

In fact, bash treats apostrophes (single quotes) in a very simple fashion. The only characters that behaves specially inside a single quote are single quote and NUL (a byte value of 0). Not even the backslash has special meaning.

Thus properly quote command line parameters inside single quotes, all we have to do is remove NULs (they don't belong there anyway) and properly quote the single quotes. Removing NULs is easy, but how to quote single quotes? You can't turn them into \' because even backslash isn't special inside single quotes.

The answer is to turn single quotes into this sequence: '\''. The first single quote closes the preceding single quote that put the shell in this mode in the first place. With single quote quoting terminated, anything goes, including the special meaning of backslash, so we can add a single quote with \'. The final single quote then starts quoting again.

If you think that explanation was difficult, here's the regular expression that does the quoting: s/\'/\'\\\'\'/sg. I love Perl!

The function cq() does the quoting:

sub cq
{
my($s) = @_;
$s =~ s/\x00//sg;
$s =~ s/\'/\'\\\'\'/sg;
$s;
}

It's used like this:

my $html = qx[curl --silent '@{[cq($url)]}'];

Did I say how much I love Perl?

By the way, am I sure that bash treats single quotes the way I have descirbed? Yes. Here's the code that proves it:

for (0..255)
{
my $e = ord(qx[echo -n '@{[cq(chr($_))]}']);
print "$_\n" if $e ne $_;
}

This code checks that all characters run the gauntlet and come out the other side unchanged. (Except for NUL, which is removed.)