Computational Tools for Linguistic Data

This will be a half-day workshop at Leiden University, hosted by the Leiden Centre for Linguistics and the Spinoza research group Lexicon and Syntax, on Friday 15 March, 2002, devoted to computational tools that linguists might make use of in their research, or linguistic problems that arise in the development of such tools.

Time/Location: 1 p.m. (sharp) until 5 p.m.
Phonetics Laboratory, building 1175, room 107, Cleveringaplaats 1, Leiden
Presentations will be 45 minutes + 10 minutes for questions, discussion, feedback.

Organisers: Simon Musgrave and Jeroen van de Weijer

Tentative programme and approximate time schedule:

1 p.m.-1.55 p.m.: Simon Musgrave (Leiden University): The Spinoza Typological database: towards integrating data entry and data analysis
Many typologists use databases in their work, but most of these contain mainly discrete-valued variables, for which each language in the sample is assigned a value. Supporting data may be included or referenced but is not integrated closely with the generalisations. The Spinoza database takes a different approach: the analyst enters primary data (ideally, natural text) and is guided through the relevant analysis of each unit of data depending on what sub-units are identified as present. The result is detailed quantitative information about typological features of natural languages. (A more detailed description is available.)

1.55 p.m.-2.50 p.m.: Peter van der Kamp and Jesse de Does (Instituut voor Nederlandse Lexicologie): Corpus annotation and retrieval 
Our presentation consists of two parts:
In the first half, we discuss our experiences with the annotation of corpora, with examples from the forthcoming Parole Internet Corpus and Integrated Language Database (ILD). 
The Parole Internet Corpus is a corpus of written language, automatically annotated with part of speech and lemma. The web-based retrieval application is scheduled for autumn 2002.
The Integrated Language Database (covering 8th-21st century Dutch) will have three interlinked components: a dictionary component (containing the dictionaries VMNW, MNW and WNT), a diachronic text corpus component, and a component with lexicons of historical and present-day Dutch.
In the second half we will give a demonstration of one of the INL Internet corpora. 
We will also give a preview of the user interface of the Parole Internet Corpus. 
Furthermore, some technical design issues of the ILD will be discussed. 
(Further information is available here.)


2.50-3.10 p.m.: coffee/tea break


3.10 p.m.-4.05 p.m.: Peter Wittenburg (MPI Nijmegen): EUDICO - tools for working with corpora
EUDICO is seen by the Max Planck Institute as the linguistic tool of the future. It is considered to be a "universal" work bench for linguists dealing with corpora as they are used at the MPI and elsewhere. EUDICO's main purpose is to offer a set of general tools for browsing, viewing, creating, editing, searching and analyzing collections of annotations on digitized video and audio recordings of linguistically interesting phenomena. (Further information here.)

4.05 p.m.-5 p.m.: Maaike Schoorlemmer, Lennart Herlaar, Harmen van der Lest, Martin Everaert, Alexis Dimiatridis and Peter Ackema (Universiteit Utrecht): Designing a flexible database system for annotating linguistic data
See slides of this presentation (PDF; 984 Kb).



