Check out the new USENIX Web site. next up previous
Next: The problem area Up: The AGFL Grammar Work Previous: Abstract

Introduction

A large part of the capacity of computers is devoted to the capture, storage, analysis, transformation and production of human-readable documents, in the form of publications, correspondence and webdocuments. Therefore, a growing number of applications is dependent on, or could benefit from, some form of linguistic analysis of documents.

In particular, Natural Language Processing (NLP) is an important enabling technology for future web-based applications: from classification of webpages, filtering and narrowcasting to more intelligent search machines and services based on the automatic interpretation of the contents of documents. As is the case in Information Retrieval (IR) in general, the state-of-the-art in search machines on the web is based mainly on the use of keywords, and only some limited linguistic techniques are used to enhance recall: stop lists, stemming, some ontologies, and the simplest of phrase recognition techniques. An example is the Linguistix software library, incorporated in commercial search machines like Altavista and Askjeeves, which performs tagging, lemmatization and fuzzy semantic matching.

No use is made of syntax analysis or semantic analysis and the discourse structure of documents is largely ignored, although their use might yield an important increase in precision. The great success of the present statistical techniques combined with such ``shallow linguistic techniques" [Sparck Jones, 1998] has led to the idea that deep linguistics is not worth the trouble.

What is worse, it is very hard to find resources to build applications using deeper linguistic techniques like parsing. In applied linguistic communities like the corpora list, many groups appear to be in need of parsers and lexica for natural languages, and requests for freely accessible linguistic resources are frequently posed. But such resources are just not available, or just not free.

There is a definite need for parsers and lexica in the public domain, so that people developing say a question answering system do not have to start by reinventing the wheel. The extraordinary success of one such resource, the [Princeton WordNet], may be due to its public availability rather than to superb quality, but it has had tremendous impact, and it is improving over time.

In this article, we make a plea for linguistic resources in the public domain and announce the public availability of the AGFL Grammar Work Lab, and the EP4IR grammar and lexicon of English. We describe how to use a parser, generated by the AGFL system from EP4IR, in practical applications.


next up previous
Next: The problem area Up: The AGFL Grammar Work Previous: Abstract
Kees Koster
2002-05-01