|
Perl Practicum: It Slices, It Dices...
by Hal Pomeranz
Splitting Data
A common task that seems to have been generating a lot of questions on
comp.lang.perl recently is how to split input records of data
in order to extract the data fields. The first impulse is to head
straight for the split() routine, but "there is always
more than one way to do it," and split() is not always
your best choice.
For example, split() does not deal gracefully with data
in fixed-width fields. Sometimes you can split() on
whitespace, but suppose one or more of the fields contain whitespace
(perhaps a "full name" field) or suppose you would like to preserve
the alignment of the data? You could use substr() , but
that only allows you to pull out one field at a time and you typically
have to remove any trailing spaces yourself. Consider using
unpack() when faced with fixed width data: it gracefully
solves all of these problems.
Fixed-Width Data
As an example, here is an "ls -n " equivalent (same as
"ls -lg " but with numeric user and group IDs instead of
names) that uses pack() and unpack() to
manipulate the output of a BSD-style ls
|
$template = "a14 A9 A9 a*";
open(LS, "ls -lg |") || die "Can't ls!\n";
while (<LS>)
{
($first, $uid, $gid, $last) = unpack($template, $_);
$uids{$uid} = (getpwnam($uid))[2] unless ($uids{$uid});
$gids{$gid} = (getgrnam($gid))[2] unless ($gids{$gid});
(getgrnam($gid))[2]unless($gids{$gid});
print pack($template, $first, $uids{$uid},$gids{$gid},$last);
}
|
There is actually a subtle bug in the above program. A completely
pointless prize will be awarded to the first person who correctly
identifies the bug to me.
The first argument to unpack() is a template describing
each field and how wide the field is. Whitespace in the template is
for readability only - it is strictly ignored by
unpack() . The first "a14" in the template means the first
field is a string of ASCII which is 14 characters long (in the ls
output, this pulls off the mode bits and link count information). This
is followed by two ASCII strings which are 9 characters long (the
owner and group of the file), but the upper-case "A" also means strip
off any trailing whitespace (so we can feed the result to the
appropriate get*nam() function). The final "a*" means
just pull off everything else on the line into the last field.
Notice that we can use the same template when we put the line back
together with pack() . Perl's interpretation of "a" and
"A" in pack() templates has been specifically designed to
make this possible. The numeric value after each operator in the
template gives the field width: "a" pads the field with nulls, and "A"
pads with spaces. A "*" instead of a number means make the field
exactly as long as the data supplied.
Irregular Data...
The split() function may also not be your best choice if
your fields are very irregular. For example, the
previous Perl Practicum showed regular expressions to
match fields in the following data record:
|
Pomeranz, Hal (pomeranz) x409
|
Recall that the last two fields are optional and sometimes the
whitespace was literal spaces, sometimes tabs, other times a mixture
of the two and there tended to be trailing whitespace. I needed to
lose the comma, the "x" before the phone extension, and the
parentheses around the email address.
I could have used split() to pull the line apart (the
example below also illustrates that the first argument to
split() is a fully-fledged regular expression)
|
@fields = split(/[\s,()]+/);
|
though I still would have had the leading "x" in the extension field
(it cannot go in the list of delimiters since the other fields might
contain an "x"). I could also get null fields at the end of the list
unless I first eliminated the trailing whitespace with
|
s/\s+$//;
|
Furthermore, what happens when split() only returns a
list of three values-is the last value an email address or a
phone extension? One could examine the field to see if it
matches /x\d{3}/ , but it would be nice to be able to say
|
($last, $first, $email, $ext) = some_expression
|
and have $email or $ext be null if there is
no such information on the line.
...And the Pattern Match
The pattern match operator, when in a list context, returns a list
containing the values matching "sub-expressions" in the pattern. A
sub-expression is anything in the pattern enclosed by parentheses;
sub-expressions are returned in the order determined by the opening
(left) parenthesis of each expression, reading from left to right. For
example, consider the following expression to extract the time from an
ASCII date string:
|
$_ = "Wed Apr 20 20:39:34 PDT 1994";
@fields = / ((\d+):(\d+):(\d+)) /;
|
There are four sub-expressions in the above pattern match--one
sub-expression enclosing three others. The opening parenthesis for the
larger sub-expression is left-most, followed by the three smaller
expressions in order. So, $fields[0] will be "20:39:34"
and the next three ele ments of the list will be set to "20", "39",
and "34".
This behavior in a list context makes pattern match a very
flexible split operator. It is worth mentioning here that if
you assign a pattern match expression to a list, then Perl
does not set the special $1, $2,..., $9 variables.
Taking the regular expression developed in the previous
Perl Practicum and assigning it to a list yields
|
($first,$last,$junk1,$email,$junk2,$ext) = /^(.+),\s+(\w+)(\s+(\((\w+)\))?(\s+x(\d+))?\s*$/;
|
The junk fields are necessary because we had to enclose the optional
expressions (which match the email and extension fields) in
parentheses. Larry Wall is working on a regular expression grouping
operator which will not generate sub-expressions, but we will
probably have to wait for a later release of Perl5.
Quotes
A very difficult splitting problem is the breaking up of records that
have quoted fields which enclose the delimiter character(s). Suppose
I had records like this:
|
"Pomeranz, Hal", Support, "Saratoga, CA, USA"
|
with some fields quoted and some not. While Perl regular expressions
are not regular expressions in the strict mathematical sense, they
cannot be used to generally solve the problem of matching opening and
closing delimiters (like parentheses or braces)-particularly if the
delimiter is a multi-character string or if you have nested
delimiters. A one-line expression to match C-style comments has
become the holy grail of comp.lang.perl and is believed to be
in the class of problems that includes trisecting an angle with only a
compass and straight-edge.
One option is to simply use split() to break each record
up (using the chosen delimiter) and then reconstruct quoted fields
after the fact. Of course, you would have to preserve the delimiters
if you take this approach. Luckily, split() allows you
to do this by using parentheses in the first argument to create a
sub-expression:
|
@list = split(/(,\s+)/);
|
Assuming the data line above is in $_,
$list[0] will be `"Pomeranz', $list[1] will
be `",' etc. Examine list's elements for leading and trailing double
quotes to reassemble the fields.
Another approach would be to try and create an expression that matches
the individual fields:
|
@fields = /("[^"]+"|[^,]+), \s+("[^"]+"|[^,]+), \s+("[^"]+"|[^,]+)/;
|
This expression requires each field to be either a double quoted
string (a double quote, followed by one or more non-quote characters,
followed by a double quote) or one or more non-comma characters. This
tactic will work as long as the records contain no nested
quotes. However, both the split() tactic and the regular
expression above will fail on records like
|
This ", would be" nasty
|
The general solution to this problem requires a small function. The
Perl distribution includes shellwords.pl which contains a
function to parse lines of space delimited, optionally quoted fields.
I have written a modified version of this library,
quotewords.pl , which accepts any regular expression as a
delimiter. You can obtain quotewords.pl from one of the Perl
archives, or directly from me via email.
Conclusion
Data reduction is a fairly common task for Perl programs, and the method you use should be carefully tailored for the data you are operating on. The split() function is good for data with regular delimiters that do not appear inside the fields themselves (the classic example is the UNIX password file). For data in fixed-width fields, use pack() and unpack() , or substr() if you only need to extract a single field. Pattern match is a good generic split function, particularly if the data are very irregular. Dealing with quoted fields is always difficult, but the problem has been solved, so you do not have to reinvent the wheel.
Reproduced from ;login: Vol. 19 No. 3, June 1994
|