Homepage Universiteit Leiden

The Leiden Centre for Linguistics

Homepage Faculteit der Letteren Homepage ULCL Search Faculty E-mail to ULCL

   PhD defences
   Advanced Master's

Constructing lexica

A one-day workshop at Leiden University, hosted by the Leiden Centre for Linguistics and the Spinoza Project "Lexicon and Syntax", on constructing lexica.

In the course of their work, many linguists find it necessary to construct lexica. Naturally, the content and structure of these lexica vary considerably depending on the task at hand, be it language description, theoretical analysis, lexicography or computational linguistics. However, some issues must be addressed by all those who build lexica, particularly when the computer is an essential tool in the process with its requirement to make structure explicit. These include:

  • what sort of data structures should be used to represent lexical knowledge?
  • what are the boundaries of the lexicon in relation to a syntactic component? in relation to encyclopaedia knowledge?
  • what are the trade-offs which it is necessary to make between comprehensiveness and practicality, both in terms of breadth (number of items included) and depth (information about each item)?
  • how feasible is it for linguists working in different areas and different frameworks to combine their lexica and avoid duplication?
  • to what extent are we guided by principle in making decisions about the design of lexica?

This workshop will include contributions from linguists working on a wide range of problems using different techniques, but all crucially dependent on structured representations of lexical knowledge. We are sure that their responses to the issues raised above will lead to stimulating and valuable discussion.

For catering purposes (coffee and lunch will be provided), it is necessary to book a place at the workshop. Please do so by contacting Simon Musgrave.

Date: Friday, 13 September, 2002
Time/Location: 10 p.m. until 5 p.m.
"Centraal Faciliteitengebouw", building 1175, Cleveringaplaats 1, Leiden
Room 1175-148 (for the whole day).

Organisers: Crit Cremers, Simon Musgrave and Jeroen van de Weijer

Tentative programme and approximate time schedule:

Presentations will be 30 minutes + 10 minutes for questions, discussion, feedback, followed by 5 minutes for tea/coffee (available in the room)

10.00 - 10.30  Coffee
10.30 - 11.15  Maarten Janssen (Utrecht): Sticking to shallow lexicons
11.15 - 12.00  Peter Austin (Melbourne & Frankfurt): Writing dictionaries for endangered languages
12.00 - 12.45   Koenraad Kuiper (Canterbury NZ & NIAS): Constructing a syntactically annotated ‘dictionary’ of phrasal lexemes
12.45 - 13.45 Lunch
13.45 - 14.30  Jack Hoeksema (Groningen): A lexicon for linguists: the lexicon of Dutch negative polarity items.
14.30 - 15.15  Dirk Geeraerts (Leuven): Lexical labeling and variational corpus linguistics
15.15 - 16.00  Wim Peters (Sheffield): Is this a way forward? Towards an Open and Distributed Lexical Infrastructure
16.00 - 16.15  Tea break
16.15 - 16.30  Crit Cremers: intro to general discussion.
16.30 - 17.00  General discussion
17.00 -   Drinks



Maarten Janssen (Utrecht): Sticking to shallow lexicons
A wide variety of linguistic phenomena urge linguists to add rich 
information to the lexicon: for a correct analysis of for instance 
bridging, coercion, presupposition, and anaphora resolution, the noun 
seems to bring a wide range of semantic features to the sentence. 
Therefore, many current lexicons provide rich semantic typings with 
dot-objects, qualia-structures, attribute-value matrices, etc.
But rich semantic typing should be handled with great care. In my 
thesis, I present a system in which lexical items are related to a 
highly structured interlingua; a system which is particularly good at 
dealing with lexical gaps in a multilingual setting. For the success 
of this set-up, the elements of the interlingua have to be taken as 
very shallowly typed. In this talk I will give a number of reasons 
why we should be very careful with adding rich semanctic and 
ontological information to the lexicon, provided with examples. 
Amongst these: perceptual features often play a role in the 
interpretation of complex phenomena, but perceptual information 
cannot be coherently modelled in a symbolic framework. Generics often 
play a role in linguistic phenomena, but generics are not even well 
understood. And even denotational, prototypical and episodic 
information seems to play a role, which cannot even be systematically 
related to interpersonal/interlingual concepts.

Dirk Geeraerts: Lexical labeling and variational corpus linguistics
Corpus linguistics may contribute in roughly three different ways to 
lexicography: first, by broadening the descriptive basis of dictionary 
making; second, by replacing manual work by automatic or semi-automatic 
analyses (as in the attempt to develop systems for the automatic 
disambiguation of polysemic words); and third, by refining traditional 
types of lexicographical information (as in collocational analyses).
Within the latter domain, however, relatively little attention has so far been 
devoted to what is a major concern for dictionaries and lexica, i.e. the
labeling of lexical items with regard to their variational properties.
To be sure, corpus linguistics has devoted a lot of attention to
variational issues (as in all kinds of stylometric studies), but the focus has
rather been on the identification of different types of texts rather than on 
methods for assigning variational labels to lexical items. Against this 
background, I will describe our research group's current line of research, 
in which we develop a corpus-based method of variational analysis for 
lexical items.

Wim Peters: Is this a way forward? Towards an Open and Distributed Lexical Infrastructure
In this talk I will present ideas and observations on the feasibility and design of an Open and Distributed Lexical Infrastructure for lexical content description and interoperability.The realization of a common platform for interoperability between different fields of linguistic activity - such as lexicology, lexicography, terminology - and Semantic Web development will provide a flexible common environment not only for linguists, terminologists and ontologists, but also for content providers and content management software vendors.

This envisaged framework will make lexical resources usable within the emerging Semantic Web scenario. It will be based on open content interoperability standards, and will involve the participation from designers, developers and users.

The framework should facilitate the integration of the linguistic information resulting from existing lexical resources and standardization initiatives, and provide ways to bridge the differences between various perspectives on language structure and linguistic content.

This approach requires, among others, the coverage of a range of aspects pertaining to linguistic modeling, and a number of organizational aspects, such as the design of a new abstract model of lexicon architecture that offers the structural bandwidth necessary to allow the inclusion/building/maintaining/accessing/tuning of such complex, shared and distributed lexical repositories. Structural flexibility must be ensured, so as to allow easy and varied import and export of various lexicon types (from very complex to very simple).

Since there are various possible scenarios for approaching this problem, extensive consultation and experimentation is needed to determine the best architecture for the representation and implementation of the lexical infrastructure. We foresee an increasing number of well-defined linguistic data categories stored in open and standardized repositories, which will be used by users to define their own structures within an open lexical framework. It is this re-usage of linguistic objects, which will link new contents to the already existing lexical objects.

The standardization effort will involve the extension and integration of existing and emerging open lexical and terminological standards and best practices such as EAGLES, ISLE, TEI, OLIF, Martif (ISO 12200) and Data Categories (ISO 12620). Initiatives towards the creation of lexical metadata such as IMDI, Dublin Core and OLAC will be taken into account.

Peter Austin: Writing dictionaries for endangered languages
The preparation and publication of dictionaries for endangered languages raise a number of issues that are special and differ from the kinds of challenges faced by lexicographers of larger languages. In this paper I discuss some of these challenges, illustrating them with examples from my research on a number of highly endangered Australian Aboriginal languages.

Koenraad Kuiper: Constructing a syntactically annotated ‘dictionary’ of phrasal lexemes
There are a sizeable number of dictionaries of English idioms in existence (e.g. Cowie, Mackin and McCaig 1975, 1983, Courteney, 1983). There are also works attempting to represent what the native speaker knows by way of phrasal lexemes (PLs) in general (Mel’cuk 1995). None of these represent the phrase structure of the items they contain, probably for the reason that this information is, for the most part, redundant because most PLs have normal phrase structure. However, there are two reasons why one might want to look at the phrase structure of PLs. The first is to have a database against which to test hypotheses about the phrase structure of such items (Kuiper and Everaert 1996, 2000). Most researchers in the field adopt the strategy of looking at idioms apparently at random. A sizeable annotated database of such items would allow for more substantial testing. The second reason is as a resource for those working in natural language understanding and machine translation. The problem that PLs pose for such systems is to assess whether a particular phrase is a PL as well as a freely created phrase, or only one or other of these. In any event, the system must try to guess whether the phrase is being used as a freely generated phrase or as a lexical item to be looked up. If the system parses the incoming text stream, then a syntactically annotated database of PLs provides richer string matching potential. Also, if the phrase structure of PLs is constrained then the search for PLs in the incoming text stream can have its search space reduced.

This talk outlines the setting up of a database of 10,000 syntactically annotated idioms now virtually complete; its conventions and potential uses.

Earlier workshop: Computational Tools for Linguistic Data


Editor: Jeroen van de Weijer
Tel. 31-71-527 2101; E-mail
Last update:
10/15/02 12:03