SAGE - Perl Practicum - The Camel Spins a Web

Perl Practicum: The Camel Spins a Web

by Hal Pomeranz

Those of you who have been living under a rock for the last twelve months may have missed out on this whole World Wide Web thing. Most of you have probably already tried your hand at some basic HTML authoring. The most interesting application of Web technology, though, is using the Web as an interface to arbitrary data from other sources such as databases and system applications. One mechanism for creating these interfaces is the Common Gateway Interface, CGI for short.

CGI Basics

Simply, a CGI program produces (on the standard output) a special header line followed by an arbitrary number of lines of output. The HTTP server running on your machine invokes your CGI script and feeds the output to the browser that requested the page (usually the hardest part of this whole equation is learning how to configure your HTTP server to execute your file). The CGI program can be written in any language you like, but why would you write in anything but Perl?

Here is a trivial example

        #!/bin/perl

        print "Content-type: text/html\n\n";
        print <<"EOmyPage";
        <The Camel Spins a Web>"Hello World!" Page</TITLE>
        <H2>HELLO WORLD!</H2>
        EOmyPage

The first line of the script prints the header information, specifying the type of document which follows the header. In this case, we are saying that the document is an HTML text document. A blank line must follow the header information (note the two \ns). The rest of the program is just a "here document" which prints a trivial HTML page. If at this point you are thinking, "That's easy!", you are absolutely correct: there is no great mystery to this CGI stuff.

External Files and Applications

However, the power of this mechanism cannot be overstated. As long as your program produces the correct output format, it can be arbitrarily complex. For example, you can read from and write to external files:

        #!/bin/perl

        print "Content-type: text/html\n\n";

        $visitors = 'cat countfile';
        $visitors++; 	
        if (open(OUT, "> countfile")) {
             print OUT $visitors; 	
             close(OUT); 	
             print <<"EOmyPage"; 	
        <The Camel Spins a Web>Welcome</TITLE>
        Hello visitor number $visitors.
        EOmyPage
        }
        else {
             print "Sorry, an error occurred\n";
        }

Be warned that your HTTP server will probably be running under some other user ID and will have that user's access rights to files on your system (try to run your servers as a user with no privileges, like the "nobody" user - NEVER give HTTP servers superuser access). Make sure that whatever files you are manipulating have the correct access rights.

It is almost never a good idea to abort a CGI program in the middle of execution. Remember that there is a user on the other side of the Internet who is expecting some sort of page to be returned by your script. Notice that the script above prints an error message if the open() fails rather than calling die() as you would usually.

Also keep in mind that you can manipulate the output of other programs from within your Perl script. In the example above, we used the UNIX cat program to retrieve the contents of a file, but CGI allows you to effectively extend the reach of the Web by making data from other programs available to Web browsers. For example, here is a little CGI script that gives back ps output from the machine it is run on (one could imagine this as part of a suite of remote diagnostic tools for a large network):

        #!/bin/perl 	
        print "Content-type: text/plain\n\n";
        if (open(PS, "ps -ef |")) { 	
             while (<PS>) { print; } 	
        } 	
        else { 	
             print "An error occurred\n";
        }

Note that we are using a different Content-type header. Plain text is usually displayed by browsers in a fixed-width font (Courier) with all whitespace preserved (unlike HTML). For those of you familiar with HTML, the output usually looks like it has been formatted in the <PRE> block.

You can call just about any program. You could interface with other network information services like gopher and WAIS, or even NNTP (how about a Web-based threaded newsreader?). You could interface with pieces of your company database and write a company phone book page, or allow people to review their benefits via the Web. However, think about security before you go off and try to save the world with the Web: you may not want everybody in the world to have easy access to much of your data. Even the ps example above potentially gives away more knowledge to people outside your organization than you should be comfortable with.

The CGI Environment

Before executing your CGI program, your HTTP server will set a number of environment variables. The CGI specification ( https://hoohoo.ncsa.uiuc.edu/cgi/interface.html) spells out exactly what information is provided, but here is a useful little test program to see for yourself:

        #!/bin/perl

        print "Content-type: text/plain\n\n";
        foreach $var (sort keys %ENV) { 	
             print "\$ENV{$var} = '$ENV{$var}'\n";
        }

For example, the REMOTE_HOST and REMOTE_ADDR variables give the fully qualified hostname and the IP address of the machine that it connecting to your HTTP server. At NetMarket we get a lot of "How'd you do that?!?" comments because our home page prints a little "Thanks for connecting from $ENV{'REMOTE_HOST'}" message.

The client browser can also send information to your HTTP server. Your HTTP server will put this information into your CGI program's environment using variables that are prefixed with HTTP_. In particular, the client will usually provide an identifying string such as NCSA Mosaic for the X Window System/2.4 libwww/2.12 modifiedin the HTTP_USER_AGENT variable. Unfortunately, there is no established format standard for user agent information, so it is nearly impossible to build a procedure which can identify an arbitrary browser from its user agent information. However, it is pretty easy to recognize most of the major browsers.

What good is identifying a browser? Remember that older browsers may not support all the latest features of the HTML specification. For example, you do not want to send a table to NCSA Mosaic 2.4 because the browser cannot format the table information, and you would not want to send an image map to a text-only browser like Lynx because the user would not be able to see the image.

Processing Forms

HTML allows you to create pages which allow the user to type in information and submit it to your server. Here is a simple HTML form:

        <The Camel Spins a Web>Send Us Email!</TITLE>

        We'd love to hear from you. Enter your email address and
        comments in the spaces provided and we'll respond as quickly as we
        can!<P>

        <FORM METHOD="POST"
             ACTION="bin/process_form">
        Your E-mail address<BR>
        <INPUT NAME="email" SIZE=45 MAXLENGTH=45><BR> 	
        Your Message<BR>
        <TEXTAREA NAME="comments" ROWS=12 COLS=45></TEXTAREA><P>
        <INPUT TYPE="submit" VALUE="Send your comments"> 	
        </FORM>

The <FORM ... ACTION=" ... "> tag specifies what program the user's browser should try to call when they submit the form information. This form creates a space for the user to enter an email address and a free-form text area for the user to type in a message. Finally, there is a Send your comments button to allow the user to submit the form information.

When the user punches the Send your comments button, the client browser bundles up all the information that the user entered in and sends that information to your HTTP server along with a request to the server to run the appropriate program from the <FORM ... ACTION=" ... ">. How your program gets the form information depends upon the <FORM METHOD="..." ...> tag. In the example form above, the form method is POST, which means that the form information will be handed to your CGI program on the standard input. You will get a blob of data whose length will be specified by the CONTENT_LENGTH environment variable. The easiest way to grab the data is with the read() function:

        #!/bin/perl

        read(STDIN,$stuff $ENV{`CONTENT_LENGTH'}); 	
        . . .

Now you have to break up the data into intelligible pieces. The data comes to you in name=value pairs separated by & characters. The names for each piece of data are whatever you specified in the form using the <... NAME=" ... "...> tags: in the example above, the name for the email field is email, and the name for the free-form text area is comments. The other tricky part is that spaces are converted to + signs and non-alphanumeric characters are generally converted to %<hex> where <hex> is the ASCII value for the character in hexadecimal notation. Typically, the beginning of all form processing programs looks like:

        #!/bin/perl

        read(STDIN, $stuff, $ENV{'CONTENT_LENGTH'}); 	
        @pairs = split(/\&/, $stuff);
        for (@pairs) { 	
             ($field, $val) = split(/=/);
             $field =~ s/\+/ /g; 	
             $field =~ s/%(\w\w)/sprintf("%c", hex($1))/eg;
             $val =~ s/\+/ /g; 	
             $val =~ s/%(\w\w)/sprintf("%c", hex($1))/eg;
             $entries{$field} = $val;
        }
        ...

First, we read the data off the standard input and then break it up into a list of name=value pairs. Then we iterate over each pair, break the pair apart, and convert the plus signs and hexadecimal escapes back to the original characters. Do not try to do the substitutions before you split everything up because some of the escaped characters may be & or =. Convert the + signs to spaces first because some of the escaped characters may be +.

Now that you have parsed out the input into an associative array, you can do anything with the information you like. You must return a page back to the user, however, as a result of their forms submission:

        print "Content-type: text/html\n\n";

        if (open(MAIL,
        "| /usr/lib/sendmail webmaster"))
        { 	
             print MAIL <<"EOdoc";
        From: The Comments Page <webmaster>
        To: webmaster
        Subject: Comments Mail

        Mail from: $entries{"email"}

        $entries{"comments"}
        EOdoc 	
             close(MAIL);

             print <<"EOpage";
        <The Camel Spins a Web>Thanks!</TITLE> 	
        Thanks for taking the time to send us comments!<P> 	
        We will be responding promptly.<P> 	
        EOpage 	
        }
        else { 	
             print <<"EOpage"; 	
        <The Camel Spins a Web>Bummer!</TITLE> 	
        We encountered an error trying to send your comments.<P> 	
        Please send mail to <I>webmaster\@netmarket.com</I><P>
        EOpage 	
        }

Be VERY careful about what you do with the data you collect from a form: remember that the user can type ANYTHING into that form and could cause huge amounts of havoc if you trust what they type in. Do not ever allow form data to be used as part of a command that you execute from your script. Notice that I will not even put the user's email address in the From: line of my message because that data might be used to generate a sendmail command if the email bounces.

Further Study

The best way to become familiar with CGI is to start writing some CGI programs. You will probably want to install your own HTTP server so that you can play around with the configuration. NCSA httpd (available via anonymous FTP from ftp.ncsa.uiuc.edu) is free and easy to build and configure, though it is not the fastest server in the world. You will also want to study the CGI overview (https://hoohoo.ncsa.uiuc.edu/cgi/overview.html) and the tips for writing secure CGI scripts (https://hoohoo.ncsa.uiuc.edu/cgi/security.html).

Sample CGI programs are available all over the Web (NCSA has a small archive of examples to get you started).

Reproduced from ;login: Vol. 20 No. 4, August 1995.

Need help? Use our Contacts page.

Last changed: May 24, 1997 pc

Perl index

Publications index

USENIX home