Constructing lexica
A one-day workshop at Leiden University, hosted by the Leiden
Centre for Linguistics and the Spinoza
Project "Lexicon and
Syntax", on constructing lexica.
In the course of their work, many linguists
find it necessary to construct lexica. Naturally, the content and
structure of these lexica vary considerably depending on the task
at hand, be it language description, theoretical analysis,
lexicography or computational linguistics. However, some issues
must be addressed by all those who build lexica, particularly when
the computer is an essential tool in the process with its
requirement to make structure explicit. These include:
- what
sort of data structures should be used to represent lexical
knowledge?
- what
are the boundaries of the lexicon in relation to a syntactic
component? in relation to encyclopaedia knowledge?
- what
are the trade-offs which it is necessary to make between
comprehensiveness and practicality, both in terms of breadth
(number of items included) and depth (information about each
item)?
- how
feasible is it for linguists working in different areas and
different frameworks to combine their lexica and avoid
duplication?
- to
what extent are we guided by principle in making decisions
about the design of lexica?
This workshop will include contributions from
linguists working on a wide range of problems using different
techniques, but all crucially dependent on structured
representations of lexical knowledge. We are sure that their
responses to the issues raised above will lead to stimulating and
valuable discussion.
For catering purposes (coffee and lunch will
be provided), it is necessary to book a place at the workshop.
Please do so by contacting Simon
Musgrave.
Date: Friday, 13 September, 2002
Time/Location: 10 p.m. until 5 p.m.
"Centraal Faciliteitengebouw", building 1175, Cleveringaplaats 1,
Leiden
Room 1175-148 (for the whole day).
Organisers: Crit
Cremers, Simon
Musgrave and Jeroen
van de Weijer
Tentative programme and approximate time schedule:
Presentations will be 30 minutes + 10 minutes for questions,
discussion, feedback, followed by 5 minutes for tea/coffee
(available in the room)
Abstracts:
Maarten Janssen (Utrecht): Sticking to shallow
lexicons
A wide variety of linguistic phenomena urge linguists to add rich
information to the lexicon: for a correct analysis of for instance
bridging, coercion, presupposition, and anaphora resolution, the noun
seems to bring a wide range of semantic features to the sentence.
Therefore, many current lexicons provide rich semantic typings with
dot-objects, qualia-structures, attribute-value matrices, etc.
But rich semantic typing should be handled with great care. In my
thesis, I present a system in which lexical items are related to a
highly structured interlingua; a system which is particularly good at
dealing with lexical gaps in a multilingual setting. For the success
of this set-up, the elements of the interlingua have to be taken as
very shallowly typed. In this talk I will give a number of reasons
why we should be very careful with adding rich semanctic and
ontological information to the lexicon, provided with examples.
Amongst these: perceptual features often play a role in the
interpretation of complex phenomena, but perceptual information
cannot be coherently modelled in a symbolic framework. Generics often
play a role in linguistic phenomena, but generics are not even well
understood. And even denotational, prototypical and episodic
information seems to play a role, which cannot even be systematically
related to interpersonal/interlingual concepts.
Dirk Geeraerts: Lexical labeling and variational corpus linguistics
Corpus linguistics may contribute in roughly three different ways to
lexicography: first, by broadening the descriptive basis of dictionary
making; second, by replacing manual work by automatic or semi-automatic
analyses (as in the attempt to develop systems for the automatic
disambiguation of polysemic words); and third, by refining traditional
types of lexicographical information (as in collocational analyses).
Within the latter domain, however, relatively little attention has so far been
devoted to what is a major concern for dictionaries and lexica, i.e. the
labeling of lexical items with regard to their variational properties.
To be sure, corpus linguistics has devoted a lot of attention to
variational issues (as in all kinds of stylometric studies), but the focus has
rather been on the identification of different types of texts rather than on
methods for assigning variational labels to lexical items. Against this
background, I will describe our research group's current line of research,
in which we develop a corpus-based method of variational analysis for
lexical items.
Wim Peters: Is this a way forward?
Towards an Open and Distributed Lexical Infrastructure
In
this talk I will present ideas and observations on the feasibility
and design of an Open and Distributed Lexical Infrastructure for
lexical content description and interoperability.The realization of a common platform for interoperability
between different fields of linguistic activity - such as
lexicology, lexicography, terminology - and Semantic Web
development will provide a flexible common environment not only
for linguists, terminologists and ontologists, but also for
content providers and content management software vendors.
This envisaged framework will make lexical resources usable
within the emerging Semantic Web scenario. It will be based on
open content interoperability standards, and will involve the
participation from designers, developers and users.
The framework should facilitate the integration of the
linguistic information resulting from existing lexical resources
and standardization initiatives, and provide ways to bridge the
differences between various perspectives on language structure and
linguistic content.
This approach requires, among others, the coverage of a range
of aspects pertaining to linguistic modeling, and a number of
organizational aspects, such as the design of a new abstract model
of lexicon architecture that offers the structural bandwidth
necessary to allow the inclusion/building/maintaining/accessing/tuning
of such complex, shared and distributed lexical repositories.
Structural flexibility must be ensured, so as to allow easy and
varied import and export of various lexicon types (from very
complex to very simple).
Since there are various possible scenarios for approaching this
problem, extensive consultation and experimentation is needed to
determine the best architecture for the representation and
implementation of the lexical infrastructure. We foresee an
increasing number of well-defined linguistic data categories
stored in open and standardized repositories, which will be used
by users to define their own structures within an open lexical
framework. It is this re-usage of linguistic objects, which will
link new contents to the already existing lexical objects.
The standardization effort will involve the extension and
integration of existing and emerging open lexical and
terminological standards and best practices such as EAGLES, ISLE,
TEI, OLIF, Martif (ISO 12200) and Data Categories (ISO 12620).
Initiatives towards the creation of lexical metadata such as IMDI,
Dublin Core and OLAC will be taken into account.
Peter Austin: Writing dictionaries for endangered languages
The preparation and publication of dictionaries for endangered
languages raise a number of issues that are special and differ
from the kinds of challenges faced by lexicographers of larger
languages. In this paper I discuss some of these challenges,
illustrating them with examples from my research on a number of
highly endangered Australian Aboriginal languages.
Koenraad Kuiper: Constructing a syntactically
annotated ‘dictionary’ of phrasal lexemes
There are a sizeable number of dictionaries of English idioms in existence
(e.g. Cowie, Mackin and McCaig 1975, 1983, Courteney, 1983). There are also
works attempting to represent what the native speaker knows by way of phrasal
lexemes (PLs) in general (Mel’cuk 1995). None of these represent the phrase
structure of the items they contain, probably for the reason that this
information is, for the most part, redundant because most PLs have normal
phrase structure. However, there are two reasons why one might want to look at
the phrase structure of PLs. The first is to have a database against which to
test hypotheses about the phrase structure of such items (Kuiper and Everaert
1996, 2000). Most researchers in the field adopt the strategy of looking at
idioms apparently at random. A sizeable annotated database of such items would
allow for more substantial testing. The second reason is as a resource for
those working in natural language understanding and machine translation. The
problem that PLs pose for such systems is to assess whether a particular phrase
is a PL as well as a freely created phrase, or only one or other of these. In
any event, the system must try to guess whether the phrase is being used as a
freely generated phrase or as a lexical item to be looked up. If the system
parses the incoming text stream, then a syntactically annotated database of PLs
provides richer string matching potential. Also, if the phrase structure of
PLs is constrained then the search for PLs in the incoming text stream can have
its search space reduced.
This talk outlines the setting up of a database of 10,000 syntactically
annotated idioms now virtually complete; its conventions and potential uses.
Earlier workshop: Computational Tools for Linguistic Data |
|