|
Perl Practicum: "I'm Beginning to See a Pattern Here"
by Hal Pomeranz
The last two Perl Practicum articles may have strayed a bit
from the true path, so let us return to some Perl basics. Regular
expressions are a Perl fundamental, but many people seem to have
trouble thinking in regular expression mode. This issue will give you
some basic strategies for not becoming overwhelmed in a soup of funny
looking characters.
The first basic rule is, always take full advantage of naturally
occurring delimiters. We put spaces between words in written (and
spoken) English because it helps us to understand it better -- look
for fixed tokens that help you break up your regular expression
"utterances". For example,
|
/^[-+]?\d+(\.\d+)?([eE][-+]?\d+)?$/
|
is just so much Greek if you try to read it all at once. Use the ()
and [] groupings to break the expression up into four manageable
pieces:
|
[+-]? \d+ (\.\d+)? ([eE][-+]?\d+)?
|
The first one is easy, an optional plus or minus sign, and the second
is trivial, one or more digits. The third says, "a literal period
followed by one or more digits," and the trailing question mark makes
the whole group optional. The fourth (also optional) group is a little
trickier: an upper or lower case `E', followed by an optional plus or
minus, followed by one or more digits. Put it all together and you
match any valid Perl number, but you probably figured this out by now.
This rule should also be applied when building up regular
expressions. Suppose we wanted to match date strings
|
Fri Jan 28 13:12:02 PST 1994
|
There are six different space separated blobs in that line, but there
are only two fundamental "types" of things to match: words ("Fri",
"Jan", and "PST") and numbers ("28", "1994", and the hours, minutes,
and seconds in the time string). Well we can just use "\w+ " for words
and "\d+ " for numbers, and the regular expression just pops out
|
# the expression below is wrong!
/^\w+ \w+ \d+ \d+:\d+:\d+ \w+ \d+$/
|
Actually, this is not quite right. The day of the month and the hour
of the day can both be single digit values, and the leading digit
position will then just be a space. So, we modify our pattern slightly
|
/^\w+ \w+\s+\d+s+\d+:\d+:\d+ \w+ \d+$/
|
I generally find "\s+ " clearer than " + "
(that's space-plus, see what I mean?) in regular expressions, even
though they don't strictly mean the same thing.
The process we used to build up the last example brings us to our
second rule: start simple and increase your complexity and level of
refinement gradually. For example, it was my recent misfortune to have
to parse a file with lines like
|
Pomeranz, Hal (pomeranz) x409
|
Sometimes the white space was literal spaces, sometimes tabs, other
times a mixture of the two, and there tended to be lots of trailing
white space. Sometimes there was no email address, sometimes there was
no extension, and sometimes there was neither.
A first cut might be
|
/^\w+, \w+ \(\w+\) x\d+$/
|
You can clearly see the four blocks corresponding to last name, first
name, email, and phone extension. Note that we have to backwhack the
parentheses around the email address because of their special meaning
in regular expressions. Now we can begin to address special cases.
The email address and phone extension are optional
|
/^\w+, \w+( \(\w+\))?( x\d+)?$/
|
Note that we have incorporated the space before the email address and
phone extension in the optional block along with each of those
fields. Theoretically, the line of data could simply end after the
first name with no additional white space. As a further refinement, we
have to deal with trailing white space, and the case where field
delimiters are not single spaces
|
/^\w+,\s+\w+(\s+\(\w+\))?(\s+x\d+)?\s*$/
|
Actually, last names can look like "Van Der Sluis" or "Cody-Lang", so
we remember Rule #1 (take advantage of naturally occurring delimiters)
and say that the last name is anything before the comma
|
/^.+,\s+\w+(\s+\(\w+\))?(\s+x\d+)?\s*$/
|
All right, we know the above expression accurately matches all the
data we might encounter because we have tested it thoroughly on actual
data (you did test thoroughly, right?). Actually, I really needed this
pattern so that I could extract the last and first names, email
address, and phone extension from the line. So now we have to make
everything we want to extract from the line into a subexpression by
throwing parentheses around the individual fields
|
/^(.+),\s+(\w+)(\s+(\((\w+)\))?(\s+x(\d+))?\s*$/
|
As Randal Schwartz is fond of saying, "Perl: checksummed line noise
with a sense of purpose."
The third rule is, never use a complex expression when a simple one
will do. For example, one expression to match IP addresses might be
|
/^([12]?\d?\d\.){3}[12]?\d?\d$/
|
but why bother? In most cases either
|
/^\d+\.\d+\.\d+\.\d+$/
/^(\d+\.){3}\d+$/
|
is more than sufficient. The first expression is probably more
readable, but your mileage may vary. In either case, the person who
has to maintain your code six months from now (who, you should
remember, might just be yourself) will thank you.
Rule number four is never forget that Perl pattern matching is greedy:
the `* ' and `+ ' operators will eat as much
as they can as long as the pattern can be satisfied. This can work in
your favor when you are doing something like
|
$_ = "/usr/local/bin/perl";
($dir, $prog) = ~/^(.*)\/(.*)$/;
|
The first ".* " will eat up everything but the last
`/ ' which we force it to match (Rule #1 again) before we
pull off the program name.
This greedy behavior can be a problem as well, particularly when you
are trying to match pairs of delimiters. For example, suppose you
wanted to match the first double quoted field in
|
$_ = `pomeranz "Hal Pomeranz" "S Clara"';
$name = ~/"(.*)"/ # wrong!
Hal Pomeranz" "S Clara
|
which is not what we wanted. Instead you want
|
$name = ~/"([^"]+)"/
|
which says match a double quote, followed by one or more things that
are NOT a double quote, terminated with another double quote. This
"match everything except my trailing delimiter" concept is a useful
trick for your Perl toolkit.
The fifth and final rule is, be careful about anchoring your patterns
with ^ and $ . Err towards using
^ and $ , even when they are not strictly
necessary. For example, a common idiom is
|
@files = grep(!/^\.\.?$/, readdir(DIR));
|
which gives you a list of files from directory handle DIR, except for
the "." (dot) and ".." (dot-dot) files. Leaving off the ^
and $ accidentally will throw away all filenames with a
dot in them, and leaving off the $ will throw out all dot
files in the directory. Either way, the result is bound to be
unexpected.
Another place where this can bite you is when you are trying to
verify the format of some data. The pattern
|
/\d+/
|
will match valid integers, but it also matches "foo2bar" and other
things which are definitely not numbers. To validate that values are
numbers you have to use
|
/^\d+$/
|
or a more complex expression like the one at the beginning of this
article.
You simply must become comfortable with regular expressions to use
Perl effectively. Always remember to break complex regular expressions
up into manageable pieces before trying to write or understand
them. Always work up from a simple case to greater stages of
refinement and complexity. Never make expressions any more complex
than they have to be or you will never be able to modify them without
breaking something else. Use greedy pattern matching to your advantage
but beware of the dark side. Finally, use ^ and $
freely to avoid unexpected problems.
Reproduced from ;login: Vol. 19 No. 2, April 1994.
|