Computational Tools for Linguistic Data
This will be a half-day workshop at Leiden University, hosted by the Leiden
Centre for Linguistics and the Spinoza research group Lexicon and
Syntax, on Friday 15 March, 2002, devoted to computational tools
that linguists might make use of in their research, or linguistic
problems that arise in the development of such tools.
Time/Location: 1 p.m. (sharp) until 5 p.m.
Phonetics Laboratory, building 1175, room 107, Cleveringaplaats 1,
Presentations will be 45 minutes + 10 minutes for questions,
Musgrave and Jeroen
van de Weijer
Tentative programme and approximate time schedule:
1 p.m.-1.55 p.m.: Simon
Musgrave (Leiden University): The Spinoza Typological database: towards integrating data entry and
Many typologists use databases in their work, but most of these contain
mainly discrete-valued variables, for which each language in the sample
is assigned a value. Supporting data may be included or referenced but
is not integrated closely with the generalisations. The Spinoza database
takes a different approach: the analyst enters primary data (ideally,
natural text) and is guided through the relevant analysis of each unit
of data depending on what sub-units are identified as present. The
result is detailed quantitative information about typological features
of natural languages. (A more detailed description is available.)
1.55 p.m.-2.50 p.m.: Peter van der Kamp and
Jesse de Does
(Instituut voor Nederlandse Lexicologie): Corpus annotation and retrieval
Our presentation consists of two parts:
In the first half, we discuss our experiences with the annotation of corpora, with
examples from the forthcoming Parole Internet Corpus and Integrated Language
The Parole Internet Corpus is a corpus of written language, automatically annotated
with part of speech and lemma. The web-based retrieval application is scheduled for
The Integrated Language Database (covering 8th-21st century Dutch) will have three
interlinked components: a dictionary component (containing the dictionaries VMNW,
MNW and WNT), a diachronic text corpus component, and a component with
lexicons of historical and present-day Dutch.
In the second half we will give a demonstration of one of the INL Internet corpora.
We will also give a preview of the user interface of the Parole Internet Corpus.
Furthermore, some technical design issues of the ILD will be discussed.
(Further information is
2.50-3.10 p.m.: coffee/tea break
3.10 p.m.-4.05 p.m.: Peter
Wittenburg (MPI Nijmegen): EUDICO - tools for working with corpora
EUDICO is seen by the Max Planck Institute as the linguistic tool of the
future. It is considered to be a "universal" work bench for linguists
dealing with corpora as they are used at the MPI and elsewhere. EUDICO's
main purpose is to offer a set of general tools for browsing, viewing,
creating, editing, searching and analyzing collections of annotations on
digitized video and audio recordings of linguistically interesting
4.05 p.m.-5 p.m.: Maaike
Schoorlemmer, Lennart Herlaar, Harmen van der Lest, Martin
Everaert, Alexis Dimiatridis and Peter Ackema (Universiteit Utrecht): Designing a flexible database system for annotating linguistic data
See slides of this presentation
(PDF; 984 Kb).