Check out the new USENIX Web site. SAGE - Perl Practicum


Perl Practicum: "I'm Beginning to See a Pattern Here"

by Hal Pomeranz

The last two Perl Practicum articles may have strayed a bit from the true path, so let us return to some Perl basics. Regular expressions are a Perl fundamental, but many people seem to have trouble thinking in regular expression mode. This issue will give you some basic strategies for not becoming overwhelmed in a soup of funny looking characters.

The first basic rule is, always take full advantage of naturally occurring delimiters. We put spaces between words in written (and spoken) English because it helps us to understand it better -- look for fixed tokens that help you break up your regular expression "utterances". For example,

        /^[-+]?\d+(\.\d+)?([eE][-+]?\d+)?$/

is just so much Greek if you try to read it all at once. Use the () and [] groupings to break the expression up into four manageable pieces:
        [+-]? \d+ (\.\d+)? ([eE][-+]?\d+)?

The first one is easy, an optional plus or minus sign, and the second is trivial, one or more digits. The third says, "a literal period followed by one or more digits," and the trailing question mark makes the whole group optional. The fourth (also optional) group is a little trickier: an upper or lower case `E', followed by an optional plus or minus, followed by one or more digits. Put it all together and you match any valid Perl number, but you probably figured this out by now.

This rule should also be applied when building up regular expressions. Suppose we wanted to match date strings

        Fri Jan 28 13:12:02 PST 1994

There are six different space separated blobs in that line, but there are only two fundamental "types" of things to match: words ("Fri", "Jan", and "PST") and numbers ("28", "1994", and the hours, minutes, and seconds in the time string). Well we can just use "\w+" for words and "\d+" for numbers, and the regular expression just pops out
        # the expression below is wrong!
        /^\w+ \w+ \d+ \d+:\d+:\d+ \w+ \d+$/

Actually, this is not quite right. The day of the month and the hour of the day can both be single digit values, and the leading digit position will then just be a space. So, we modify our pattern slightly
        /^\w+ \w+\s+\d+s+\d+:\d+:\d+ \w+ \d+$/

I generally find "\s+" clearer than " +" (that's space-plus, see what I mean?) in regular expressions, even though they don't strictly mean the same thing.

The process we used to build up the last example brings us to our second rule: start simple and increase your complexity and level of refinement gradually. For example, it was my recent misfortune to have to parse a file with lines like

        Pomeranz, Hal	(pomeranz) 	 x409

Sometimes the white space was literal spaces, sometimes tabs, other times a mixture of the two, and there tended to be lots of trailing white space. Sometimes there was no email address, sometimes there was no extension, and sometimes there was neither.

A first cut might be

        /^\w+, \w+ \(\w+\) x\d+$/

You can clearly see the four blocks corresponding to last name, first name, email, and phone extension. Note that we have to backwhack the parentheses around the email address because of their special meaning in regular expressions. Now we can begin to address special cases.

The email address and phone extension are optional

        /^\w+, \w+( \(\w+\))?( x\d+)?$/

Note that we have incorporated the space before the email address and phone extension in the optional block along with each of those fields. Theoretically, the line of data could simply end after the first name with no additional white space. As a further refinement, we have to deal with trailing white space, and the case where field delimiters are not single spaces
        /^\w+,\s+\w+(\s+\(\w+\))?(\s+x\d+)?\s*$/

Actually, last names can look like "Van Der Sluis" or "Cody-Lang", so we remember Rule #1 (take advantage of naturally occurring delimiters) and say that the last name is anything before the comma
        /^.+,\s+\w+(\s+\(\w+\))?(\s+x\d+)?\s*$/

All right, we know the above expression accurately matches all the data we might encounter because we have tested it thoroughly on actual data (you did test thoroughly, right?). Actually, I really needed this pattern so that I could extract the last and first names, email address, and phone extension from the line. So now we have to make everything we want to extract from the line into a subexpression by throwing parentheses around the individual fields
        /^(.+),\s+(\w+)(\s+(\((\w+)\))?(\s+x(\d+))?\s*$/

As Randal Schwartz is fond of saying, "Perl: checksummed line noise with a sense of purpose."

The third rule is, never use a complex expression when a simple one will do. For example, one expression to match IP addresses might be

        /^([12]?\d?\d\.){3}[12]?\d?\d$/

but why bother? In most cases either
        /^\d+\.\d+\.\d+\.\d+$/

or
        /^(\d+\.){3}\d+$/

is more than sufficient. The first expression is probably more readable, but your mileage may vary. In either case, the person who has to maintain your code six months from now (who, you should remember, might just be yourself) will thank you.

Rule number four is never forget that Perl pattern matching is greedy: the `*' and `+' operators will eat as much as they can as long as the pattern can be satisfied. This can work in your favor when you are doing something like

        $_ = "/usr/local/bin/perl"; 	
        ($dir, $prog) = ~/^(.*)\/(.*)$/;

The first ".*" will eat up everything but the last `/' which we force it to match (Rule #1 again) before we pull off the program name.

This greedy behavior can be a problem as well, particularly when you are trying to match pairs of delimiters. For example, suppose you wanted to match the first double quoted field in

        $_ = `pomeranz "Hal Pomeranz" "S Clara"';

The expression
        $name = ~/"(.*)"/	# wrong!

will set $name equal to
        Hal Pomeranz" "S Clara

which is not what we wanted. Instead you want
        $name = ~/"([^"]+)"/

which says match a double quote, followed by one or more things that are NOT a double quote, terminated with another double quote. This "match everything except my trailing delimiter" concept is a useful trick for your Perl toolkit.

The fifth and final rule is, be careful about anchoring your patterns with ^ and $. Err towards using ^ and $, even when they are not strictly necessary. For example, a common idiom is

        @files = grep(!/^\.\.?$/, readdir(DIR));

which gives you a list of files from directory handle DIR, except for the "." (dot) and ".." (dot-dot) files. Leaving off the ^ and $ accidentally will throw away all filenames with a dot in them, and leaving off the $ will throw out all dot files in the directory. Either way, the result is bound to be unexpected.

Another place where this can bite you is when you are trying to verify the format of some data. The pattern

        /\d+/

will match valid integers, but it also matches "foo2bar" and other things which are definitely not numbers. To validate that values are numbers you have to use
        /^\d+$/

or a more complex expression like the one at the beginning of this article. You simply must become comfortable with regular expressions to use Perl effectively. Always remember to break complex regular expressions up into manageable pieces before trying to write or understand them. Always work up from a simple case to greater stages of refinement and complexity. Never make expressions any more complex than they have to be or you will never be able to modify them without breaking something else. Use greedy pattern matching to your advantage but beware of the dark side. Finally, use ^ and $ freely to avoid unexpected problems.

Reproduced from ;login: Vol. 19 No. 2, April 1994.


?Need help? Use our Contacts page.
Last changed: May 24, 1997 pc
Perl index
Publications index
USENIX home