Many web pages display data in a custom format, using HTML markup to set off important parts of the text typographically or spatially. Figure 6 shows part of a page describing user interface toolkits [17]
AlphaWindow, Cumulus Technology Corp., 1007 Elwell Court, Palo Alto, CA, 94303, (415) 960-1200, $750, Unix, Discontinued, Alpha-numeric terminal windows, Window System
Altia Design, Altia,
Amulet, |
Figure 6: Excerpt from a web page describing user interface toolkits.
|
Each toolkit on this page is contained in a single paragraph (<P> element in HTML). So we might start by describing the toolkit as the Paragraph element, which is identified by the built-in HTML parser:
Toolkit = Paragraph
Finding the prices is straightforward using Number, a region set identified by the built-in USEnglish parser:
Price = ("\$" then Number | "FREE") in Toolkit;Finding toolkits that run under Macintosh is easy (Toolkit contains "Mac"), since the page refers consistently to Macintosh as ``Mac''. But Unix platforms are sometimes described as ``X'', ``X Windows'', or ``Motif'', and Microsoft Windows is also called ``MS Windows'' or just plain ``Windows''. We deal with these problems by defining a constraint for each kind of platform that specifies all these possibilities and further constrains the matched literal to be a full Word (not just part of a word):
Macintosh = Word, "Mac"; Unix = Word, ("Unix" | "X" | "Motif"); MSWindows = Word, ("PC" | "Windows" but not just after "X");Using these definitions, we can readily filter the web page for toolkits matching a certain requirements (Toolkit, contains Unix, contains MSWindows) and sort them according to Price.