Many web pages display data in a custom format, using HTML markup to set off important parts of the text typographically or spatially. Figure 6 shows part of a page describing user interface toolkits [17]
|
AlphaWindow, Cumulus Technology Corp., 1007 Elwell Court, Palo Alto, CA, 94303, (415) 960-1200, $750, Unix, Discontinued, Alpha-numeric terminal windows, Window System
Altia Design, Altia,
Amulet, |
|
Figure 6: Excerpt from a web page describing user interface toolkits.
|
Each toolkit on this page is contained in a single paragraph (<P> element in HTML). So we might start by describing the toolkit as the Paragraph element, which is identified by the built-in HTML parser:
Toolkit = Paragraph
Finding the prices is straightforward using Number, a region set identified by the built-in USEnglish parser:
Price = ("\$" then Number | "FREE")
in Toolkit;
Finding toolkits that run under Macintosh is easy (Toolkit
contains "Mac"), since the page refers consistently to Macintosh as
``Mac''. But Unix platforms are sometimes described as ``X'', ``X
Windows'', or ``Motif'', and Microsoft Windows is also called ``MS
Windows'' or just plain ``Windows''. We deal with these problems by
defining a constraint for each kind of platform that specifies all
these possibilities and further constrains the matched literal to be a
full Word (not just part of a word):
Macintosh = Word, "Mac";
Unix = Word, ("Unix" | "X" | "Motif");
MSWindows = Word, ("PC" |
"Windows" but not just after "X");
Using these definitions, we can readily filter the web page for toolkits
matching a certain requirements (Toolkit, contains Unix,
contains MSWindows) and sort them according to Price.