The 16th Annual International AFRILEX Conference
UNAM, Windhoek, Namibia, 5-7 July 2011

[Abstract:] Heid, U. & Daan J. Prinsloo: Linking dictionary and corpus data in online language tools

Objectives. With a view to providing customized lexicographic data for individual users and usage situations, it has often been proposed to combine the structured data from a dictionary database with textual data from the internet or from corpora. Recently, Tarp (to appear) has called such solutions “lexicographic Model T Fords”, which “ […] link to the internet where already existing data is reused in order to satisfy the users’ specific needs”. An early implementation is Køhler-Simonsen’s (2006) specialized dictionary, ZooLex. Others will be discussed below.

In this paper, we argue in favour of more dynamically linking corpus data to a dictionary, to illustrate details of the use of multiword expressions (noun+verb-collocations in our examples), with a view to text production needs. We show that, by using computational linguistic tools, corpus examples can not only be displayed, but also linguistically analysed and generalized, to give the dictionary user a clearer picture of their actual use in texts. This seems to go into the direction of what Tarp (to appear) calls “solutions based upon a recreation and re-representation of the data“, an approach which he sees as needed for the future (“lexicographic Rolls-Royces”).

Current solutions. Several online dictionaries provide some kind of access to corpus data. The technically simplest method  (and the least useful one for users) is to juxtapose both types of resources in a common graphical interface (cf. DWDS[1], and the criticism by Asmussen, to appear). An alternative is to provide a portal which links from the dictionary to one or several corpora and to other internet resources; this is done in the 2009 version of Verlinde’s Base lexicale du français (BLF[2]), where the DAFLES[3][4] of parallel corpora (cf. Verlinde, Leroyer, Binon 2009). Preliminary tests with 33 users in a usability laboratory (cf. Bank 2010) showed that most users find the portal function difficult to use: it was not clear to them when and why they ended up on websites not belonging to the dictionary, how to interpret the data given there, and how to navigate back to the dictionary itself. dictionary is linked, among others, to the OPUS website

A more focused approach is followed in, a version of the Danish dictionary Den Danske ordbog[5]: in its main user interface, readings and collocations are listed, and a small clickable icon “ K ” gives access to a KWIC representation of corpus sentences containing the respective collocation.

Proposed approach. We intend to follow’s approach, by relating specific lexicographic data on collocations with a search engine for corpus data and an underlying corpus. The collocations contained in the dictionary provide lemmas to search for, as well as the word class and grammatical relation of the collocation base and the collocate. Regular expression search is sufficient for languages with limited word and constituent order freedom. Cf. also recent developments in BLF, Verlinde (2011).

A test on Afrikaans aandag gee and aandag skenk (“pay attention”) in the 127 m words from Beeld[6] showed that the former is ca. 6 times more frequent than the latter. Similarly, fela pelo (“be disheartened”) in the Northern Sotho Pretoria Sepedi Corpus (PSC, 5 million words, cf. De Schryver and Prinsloo (2000)) gives a clear distribution over morphological forms, fela (302 hits), fele (135), felang (5), felwa (2), and useful information to the user in terms of other frequent collocations of pelo, e.g. beta pelo (92) (“take courage”), kwa pelo (28) (“hear/listen to one's heart”), hlomola pelo (71) (“feel sorry for”) etc. Similarly the most frequent collocations for the user looking up the lemma ipona (“to see oneself”) such as ipona molato (42) (“see oneself guilty”), ipona phošo (23) (“see oneself at fault”) ipona botlaela (9) (“see oneself foolish”) are culled from the corpus.

For German, which has much case syncretism and a relatively free constituent order, results are slightly less good, as in addition to true positives also sentences are picked up where the two items in question don’t form a collocation; thus, the use of syntactically analysed texts is preferable. On this basis, Weller and Heid (2010) have proposed to not only extract example sentences for collocations, but also data about the morphosyntactic properties of the base and the collocate, such as number, determination, voice, tense, etc. Such data are extracted along with each example sentence and stored in a database. In a second step, preferences are calculated for each collocation. Such preferences (e.g. have high hopes typically in the plural) can be signalled to the user (on demand) in the graphical user interface, along with relevant examples. This also concerns lexical variation in idioms and collocations; for example, the German idiom keinen Mucks machen (“not say a word”) has variants like keinen Mucks geben, keinen Mucks tun.

Proposed presentation. In the talk, we intend to present the main argument, the state of the art, as well as results of experiments on Afrikaans, German and Northern Sotho, and mock-up screens of a possible user interface. We also intend to address the possibilities and limitations of the proposed approach.


Asmussen, Jörg: “Combined Products: Dictionary and Corpus”, to appear in: Gouws et al. (Eds.): Dictionaries. An International Handbook, Vol. 5/4 (Berlin: De Gruyter), to appear.

Bank, Christina (2010). Die Usability von Online-Wörterbüchern und elektronischen Sprachportalen. Universität Hildesheim: MA-Dissertation, ms. 103 pp.

De Schryver, Gilles-Maurice, Danie J. Prinsloo (2000). The compilation of electronic corpora with special reference to the African languages. Southern African Linguistics and Applied Language Studies, 18(1-4): 89 – 106, 2000.

Køhler-Simonsen, Henrik (2006). “ZooLex: the wildest corporate reference work in town?”, in: Elisa Corino, Carla Marello, Cristina Onesti (Eds.): Atti del XII Congresso Internazionale di Lessicografia, EURALEX, Torino, (Alessandria: Edizioni dell’Orso) 2006: 787-793.


Tarp, Sven: “Lexicographical and other e-tools for consultation purposes: towards the individualization of needs satisfaction”, to appear in: Pedro A. Fuertes Olivera, Henning Bergenholtz (Eds): e-Lexicography: The Internet, Digital Initiatives and Lexicography. (London/New York: Continuum), to appear 2011

Verlinde, Serge: “Modeling Interactive Reading, Translation and Writing Assistants”, to appear in: Pedro A. Fuertes Olivera, Henning Bergenholtz (Eds): e-Lexicography: The Internet, Digital Initiatives and Lexicography. (London/New York: Continuum), 2011

Verlinde, Serge, Patrick Leroyer, Jean Binon (2009): “Search and you will find. From stand-alone lexicographic tools to user-driven task and problem-oriented multifunctional leximats”, in: IJL 23:1 (2009): 1-17.

Weller, Marion, Ulrich Heid (2010): ``Multi-parametric extraction of German multiword expressions from parsed corpora'', in: Proceedings of LREC-2010, Linguistic Resources and Evaluation Conference, Malta, 2010 [CD-ROM].



[3] DAFLES: Dictionnaire d'Apprentissage du Français Langue Étrangère ou Seconde



[6] We use a section of the Pharos Media24 Afrikaans corpus, made available to us by Pharos publishers. We gratefully acknowledge Pharos's contribution to the present work.