AFRILEX 2011 @ UNAM

The 16th Annual International AFRILEX Conference
UNAM, Windhoek, Namibia, 5-7 July 2011

[Abstract:] Brits, J.H. & Rigardt Pretorius: The automatic lemmatiser as a tool to enhance access to Setswana dictionaries

Keywords:  Setswana morphology, root, stem, lemmatisation, Setswana dictionaries, nominal lemma, natural language processing, language acquisition.

Lexicographers have the difficult task to balance user-friendliness, a budget, limited space and a variety of target audiences when doing their work. The task becomes even more difficult when it comes to the lexicographers of Southern Bantu languages. One way of dealing with these problems would be to go the electronic way – as has been successfully done with the Northern-Sotho dictionary. Unfortunately at this stage, Setswana does not have a commercial electronic dictionary available.

At the Potchefstroom Campus of the North-West University, Setswana is taught as foreign language to students who mainly speak Germanic languages, i.e. Afrikaans and English.[1] For many of them Setswana is the first encounter they have with a Bantu language with a grammar vastly different from theirs. A good learner dictionary is indispensable for the acquisition of a new language, but as we shall argue, students taking Setswana as foreign language find Setswana paper dictionaries difficult to use. However, there are some resources available in the field of natural language processing and in this paper we want to show how morphology and an automatic lemmatiser might be used to make existing dictionaries more accessible for students of Setswana.

Lemmatisation is a natural language processing procedure that determines the lemma of an input word. It can be used for text mining, to make indexes, list, concordances and many other applications in corpus-based research. It plays an important role in the bigger picture of natural language processing, but it has the potential to do so much for the ordinary users of Setswana dictionaries.

In this paper we shall discuss how four Setswana dictionaries are structured. For some dictionaries, students need advanced linguistic knowledge to find the right lemma entry. It is therefore important to give an overview of the lemmatisation procedure followed in Setswana dictionaries.  We took the following dictionaries:  Setswana-English dictionary of Brown (1988), Setswana-Engels-Afrikaanse woordeboek of Snyman et al. (1990), Setswana English Setswana dictionary of Matumo (1993) [2] and the Kompakte Setswana woordeboek of Dent (1994). We looked up two nouns (“woman” and “axe”) and two verbs (“to answer” and “to open”) in Setswana and we found the following results of lemmatisation procedures in the Setswana dictionaries:

 

Setswana Dictionaries

 

Brown (1988)

Snyman et al.

(1990)

Matumo

(1993)

Dent

(1994)

-sadi

 

+

 

 

mosadi (“woman”)

+

 

+

+

basadi (“women”)

+

 

 

+

-lepe

 

+

 

 

selepe (“axe”)

+

 

+

+

dilepe (“axes”)

+

 

+

+

araba (“answer”)

+

+

+

+

arabile (“answered”)

 

 

+

 

arajwa (“be answered”)

 

+

+

 

bula (“open”)

+

+

+

+

budisa (“let open”)

 

+

 

 

bulaka (“open wide”)

 

 

+

 

bulega (“become opened”)

+

+

+

 

bulegile (“has been opened”)

 

 

+

 

bulegileng (“that has been opened”)

 

 

 

+

bulela (“open for”)

+

 

+

 

butswe (“was opened”)

 

 

+

 

Secondly, we shall have a look at lemmatisation as it was applied in the rule-based automatic lemmatiser by Brits (2006). In natural language processing there are two main approaches in developing applications, namely rule based and machine learning. In this particular automatic lemmatiser for Setswana the rule-based approach was followed. It means that the suggested hierarchy in morphological analyses by Krüger (2006) and Kotzé (2005) played a big role in the development of this lemmatiser. In this section the terms “root”, “stem” and “lemma” will be discussed briefly from a morphological point of view.

This brings us to the research question of the paper: Can the morphological analysis presented in the automatic lemmatiser make Setswana dictionaries more accessible? We shall then illustrate how we adapted this lemmatiser to help in the students’ search for words and meanings. The user will be able to insert the input (search) word and the lemmatiser will then give as output (almost) all the possible forms of that word to cater for all the lemmatisation techniques followed in paper dictionaries.

We shall then report on the efficiency of the lemmatiser for the students of Setswana. In our experiment there will be two groups made up of students from different study years. The experimental group will answer a general questionnaire about their experiences with Setswana paper dictionaries and look up words (selected from reading texts) with the help of the lemmatiser.  The control group will answer the same questionnaire and look up the same words but without the lemmatiser.  In the last section we shall then discuss our findings and recommendations.

Bibliography

BRITS, J.H. 2006. Outomatiese Setswana lemma-identifisering [“Automatic Setswana lemmatisation”]. Potchefstroom : North-West University.  (Thesis – MA).

BROWN, J.T. 1988. Setswana-English dictionary. Johannesburg : Pula Press.  593 p.

DENT, G.R. 1994. Kompakte Setswana woordeboek. Pietermaritzburg : Shuter & Shooter.  207 p.

KOTZÉ, A.E. 2005. Towards a morphological analyser for past tense forms in Northern Sotho: verb stems with final 'm' and 'n'. Southern African Linguistics and Applied Language Studies, 23(3): 245-258.

KRÜGER, C.J.H. 2006. Introduction to the morphology of Setswana. München : Lincom Europa. 314 p.

MATUMO, Z.I. 1993. Setswana English Setswana dictionary. 4th ed. Gaborone, Botswana : Macmillan Botswana. 647 p.

SNYMAN, J.W., SHOLE, J.S. & LE ROUX, J.C.  1990.  Setswana-Engels-Afrikaanse woordeboek.  Pretoria : Via Afrika.  527 p.



[1] We also teach Setswana as first language on the Potchefstroom Campus, dictionary use by mother-tongue speakers is not the focus of this paper.

[2] Based on the Setswana English dictionary of Brown (1988).