SAGE - Perl Practicum - The Email of the Species

Perl Practicum: The Email of the Species

by Hal Pomeranz

A lot of people seem to be interested in writing Perl scripts that either send email or parse email messages. This may have something to do with the growth of spamming software, or it may simply be a symptom of the growth of the Web/CGI and the Internet in general. This column presents some sample code to make handling email in Perl relatively painless.

Is This Address Valid?

There are a couple of ways to validate email addresses. The simplest validation is to compare the address against a regular expression. One possibility is:

        /^[^@]+@[^@]+\.[A-Za-z]{2,4}$/

Translate this to "some stuff followed by `@', followed by more stuff, and ending with a literal dot and two to four letters" (four letters for the ".arpa" and ".nato" domains). This still permits invalid email addresses like

        foo@..com
        foo@bar.baz

and addresses with various special characters that are not generally permitted in usernames or domain names. It is probably dangerous to constrain the username portion of the regexp given the proliferation of X.400 and various PC email packages that allow all manner of strange characters on the lefthand side of the address. However, the righthand side could be tightened up:

        /^[^@]+@([-\w]+\.)+[A-Za-z]{2,4}$/

The righthand side now requires one or more subdomains followed by a dot before the top level domain specifier. You are welcome to list out all the valid three- and four-letter domain names-if you know them all (there are seven valid three-letter domains, prizes to the first person to send me the correct list in email).

Another alternative is to interrogate the domain name service about the domain portion of the email address. The difficulty is that you have to check for either a mail exchanger (MX) record for the domain or an Internet address (A) record. Here is some sample output from the nslookup command:

        % nslookup -q=any cc.swarthmore.edu
        Server: localhost
        Address: 127.0.0.1

        Non-authoritative answer:
        cc.swarthmore.edu preference = 0, mail exchanger =
        cc.swarthmore.edu
        cc.swarthmore.edu internet address = 130.58.64.20

        Authoritative answers can be found from:
        swarthmore.edu nameserver = CS.swarthmore.edu
        swarthmore.edu nameserver = DNS-EAST.PREP.NET
        CS.swarthmore.edu internet address = 130.58.68.10
        DNS-EAST.PREP.NET internet address = 129.250.252.10

If any line starts with the domain name we are querying ("cc.swarthmore.edu") and contains either "mail exchanger" or "internet address" information (the above domain happens to have both), then the domain name is valid from an email perspective. We can codify this into the following function:

        sub valid_address {
        	my($addr) = @_;
        	my($domain, $valid);
         	return(0) unless ($addr =~ /^[^@]+@([-\w]+\.)+[A-Za-z]
        					{2,4}$/);
        	$domain = (split(/@/, $addr))[1];
        	$valid = 0; open(DNS, "nslookup -q=any $domain |") ||
        					return(-1);
        	while (<DNS>) {
        		$valid = 1 if (/^$domain.*\s(mail exchanger|
        					internet address)\s=/);
        	}
        	return($valid);
        }

The function returns "-1" on error, "0" if the address is invalid, and "1" if the address is valid. Note that we verify the address against the regular expression first, before paying the cost of invoking another process.

The function still does not verify the user portion of the address, but this is essentially an intractable problem. With most organizations installing firewalls between their machines and the Internet, it is unlikely that your machine could discover, much less contact, the machine where final delivery will take place. Only at this machine, however, can you verify the authenticity of the user portion of the address.

Sending Email

The preferred mechanism for sending email from a program is by invoking sendmail directly because the program more easily manipulates header information. Besides the "To:", "From:", and "Subject:" headers, consider using "Reply-to:", "Errors-to:", and "Precedence:", particularly if you are sending out a mass mailing of some sort.

Here is a simple function for sending email to a list of recipients:

        sub send_email {
        	my($recip_ref, $header_ref, $body_ref) = @_;
        
        	open(MAIL, "| /usr/lib/sendmail @{$recip_ref}") ||
        					return(undef);
        	foreach $key (keys(%{$header_ref})) {
        		print MAIL "$key: $$header_ref{$key}\n";
        	}
        	print MAIL "\n";
        	print MAIL @{$body_ref};
        	close(MAIL);
        	return(1);
        }

The function expects three references as arguments: a list reference containing the list of actual recipients, a hash reference containing the header information, and a list reference containing the lines of the body of the message. The hash reference should look like this:

        {'To' ='foo@bar.com baz@bar.com',
        'From' ='Mail Program <you@yourdomain.com>',
        'Subject' ='This here is some mail',
        'Precedence' ='bulk',
        ...
        }

Lines in the body should have trailing newlines (or you will have to modify the function to insert them). The function returns nonzero on success and undef on failure.

Note that the function has the path to sendmail hard-coded. Change this if your sendmail binary is not in /usr/lib. If you are sending a large number of mail messages, be sure to put a sleep() statement between batches of email, or you will be responsible for a denial of service attack on your own machine and your organization's mail gateway.

If you send a large number of email messages in a short period of time, you will surely start to run more processes than your OS wants you to. Be sure to defend against this failure if you think you will be starting more than a few dozen processes.

Receiving Email

Parsing an email message is a little more tricky. A typical email message looks like this:

        From somebody@somedomain.com Thu Feb 6 15:19 PST 1997 <header1>: <stuff>
        <header2>: <more stuff> 	
        	<more stuff for header2>
        ...
        <headerN>: <stuff>
        
        <line1>
        ...
        <lineN>

In UNIX mailboxes, messages always begin "\nFrom " (note the trailing space). That line is followed by one or more colon-separated lines of header information. Header lines may continue onto two or more lines, but continuation lines must begin with whitespace. The headers are terminated by a blank line. The body is everything else until the next "\nFrom ".

Suppose we have the lines from a single email message broken out into a list. We need a function to break the message out into a hash structure for easy manipulation. The keys of the hash will be the various headers, and the corresponding values will be the associated data.

        sub parse_email {
        	my(@lines) = @_;
        	my($line, $header, $val, %hash);
        
        	shift(@lines);
        	while (@lines) {
        		$line = shift(@lines);
        		last if ($line =~ /^\s*$/);
        		$line =~ s/\s*$//;
        		if ($line =~ /^\s+/) {
        			$line =~ s/^\s+//;
        			$hash{$header} .= " $line";
        		}
        		else {
        			($header, $val) = split(/:\s+/, $line, 2);
        			$hash{$header} = $val;
        		}
        	}
        	@{$hash{"BODY"}} = @lines;
        	return(%hash);

First the function throws away the initial "From " line. Then the function eats lines out of the list until it encounters a blank line marking the end of the headers. For each header line, the function checks to see whether the line is a continuation line (starts with whitespace) or a new header. Continuation lines are appended to the previous header value. New lines are split in two on the first colon and stuffed into the hash. Once the headers are dispensed with, the remaining body lines are stuffed into a list reference in the hash.

The only difficulty is that certain headers, e.g., "Received:", can appear more than once. To resolve this problem, change all of the values in the hash to list references in order to accommodate the extra data:

        sub parse_email {
	my(@lines) = @_;
	my($line, $header, $val, %hash);

	shift(@lines);
	while (@lines) {
		$line = shift(@lines);
		last if ($line =~ /^\s*$/);
		$line =~ s/\s*$//;
		if ($line =~ /^\s+/) {
			$line =~ s/^\s+//;
			$val .= " $line";
			next;
		}
		push(@{$hash{$header}}, $val) if ($header);
		($header, $val) = split(/:\s+/, $line, 2);
	}
	push(@{$hash{$header}}, $val) if ($header);
	@{$hash{"BODY"}} = @lines; return(%hash);
        }

The algorithm has been modified slightly: instead of stuffing new header information into the hash immediately and then appending continuation lines, the entire header is pulled together and stuffed into the hash only when a new header is encountered. The expression for appending continuation lines to the last element of an anonymous list reference in the hash was nearly gibberish.

Be Good

You now have more than enough rope to hang yourself, so let us close with a couple of admonishments. First, do not use this code to send unsolicited email or spam to anybody-you are only stealing from your potential customers and/or targets. Second, if you plan on writing your own version of the "vacation" program (why? yet people seem to do this all the time), make sure you pay attention to the "Precedence:" header and do not send responses to any message marked "Precedence: bulk". If you do, you will possibly be spamming an entire mailing list. Thus ends the airing of your humble author's pet peeves.

Reproduced from ;login: Vol. 22 No. 2, April 1997.

Need help? Use our Contacts page.

Last changed: May 24, 1997 pc

Perl index

Publications index

USENIX home