AFRILEX 2011 @ UNAM

The 16th Annual International AFRILEX Conference
UNAM, Windhoek, Namibia, 5-7 July 2011

[Abstract:] Prinsloo, D.J.: A critical analysis of the lemmatisation of nouns and verbs in isiZulu

The publication of the first dictionary for isiZulu using a word, instead of the traditional stem lemmatisation strategy reopens the debate on stem versus word lemmatisation in African languages. In particular the question is whether the problem of stem identification which proved to be the major stumbling block for learners to find lemmas in isiZulu dictionaries has been solved? To date most publications on lemmatisation in the African languages were contrasting disjunctively written languages (e.g. Sepedi, Setswana and Sesotho) with those with a conjunctive orthography (e.g. isiZulu, Siswati and isiXhosa) in terms of the advantages and disadvantages of stem versus word lemmatisation. It was mainly argued that stem lemmatisation is an accepted, or even the best strategy for conjunctively written languages, but that word lemmatisation is a better option for disjunctively written languages mainly because stem lemmatisation introduces unnecessary problems for the user of a dictionary of a disjunctively written language, e.g. to identify nominal stems. The stem tradition, nevertheless, supported by certain assumptions such as being the more scientific option gained such momentum that a number of stem dictionaries were compiled for the Sotho languages. Word lemmatisation for conjunctively written languages was considered by Van Wyk (1995) and preliminary experiments were conducted at some of the National Lexicography Units in South Africa on the feasibility and possible advantages of word lemmatisation for conjunctively written languages. However, it was only in 2010 with the publication of the Oxford Bilingual School Dictionary: Zulu and English (OZSD) that the almost sacred stem tradition of lemmatisation for a Nguni language was broken using word lemmatisation for an isiZulu dictionary.

The focus of this paper differs from previous publications in the sense that first, the issue of stem identification takes centre stage, and secondly that the advantages and disadvantages of stem versus word lemmatisation will not be described in terms of conjunctively versus disjunctively written languages but in terms of the benefits versus shortcomings of these approaches for the conjunctively written Nguni languages, isiZulu being a case in point. Thirdly, although a selection of examples will be offered, example analysis will be focused on a paradigm of 2,525 occurrences of different words containing the stem sebenza 'work' occurring 5 times or more in the Pretoria isiZulu Corpus (PZC).

A consolidation of the most prominent views on stem versus word lemmatisation which lie scattered over a number of publications will also be attempted. Finally, the success or potential of electronic dictionaries to solve stem identification problems which cannot be solved in paper dictionaries, irrespective of the lemmatisation strategy, will be evaluated.

The discussion focuses on strict stem lemmatisation, lemmatising stems and suffixes, left expanded article structures, word lemmatisation and lemmatisation in electronic dictionaries. Consider the following simplified examples where boldface indicate the lemma:

Strict stem lemmatisation

sebenza

Stem plus suffixes

sebenzela

Leftexpanded

imisebenzi, ukusebenza

word lemmatisation

imisebenzi

It will be concluded that the weakest option for lemmatising nouns and verbs in isiZulu is the strict stem strategy where the lemma is the basic stem: in the case of verbs, verbal stem without suffixes and in the case of nouns, noun stem without nominal prefixes. This lemmatisation strategy is not-user friendly, stem identification is a major obstacle, a huge amount of knowledge of morphophonetics is presupposed and the user is often in doubt whether (s)he has successfully retrieved information. Even if the users managed to identify the stem and to look it up, all the additional information conveyed by the affixes have to be ‘added back on’ and the user will not know for sure whether (s)he came to the right conclusion. Lemmatising verb stems with their suffixes represents a slight improvement. At least the meanings of the suffixes need not be artificially added on as in the case of strict stem lemmatisation.

Lemmatising stems with their prefixes is a better option because the user has the advantage of seeing the full form of infinitive verbs and the full forms of nouns with additional information such as tonal indication. This strategy is more user-friendly but stem identification remains problematic and a substantial amount of knowledge of morphophonetics is still presupposed.

Word lemmatisation applicable to nouns is by far the better strategy because nouns can be looked up under the first letter. For given non-derived nominal forms the problem of stem identification is solved for all nouns. This strategy is especially beneficial for those nouns where stem identification is problematic. The strategy is user-friendly and no knowledge of the grammar is presupposed. However, for nominal and verbal derivations, especially those where nominal and verbal stems occur with huge clusters of circumfixes, the problem of stem/word identification remains unsolved.

The problem of word/stem identification which is present in all of the lemmatisation strategies employed for isiZulu can only be solved in electronic dictionaries. Most electronic dictionaries are mere translated word lists and are not of much use to the target users especially for their productive needs. A clear exception is isiZulu.net where the problem of stem/word identification has been solved for most of the frequently used words in isiZulu, but more comprehensive electronic isiZulu dictionaries are required to alleviate the need for stem/word identification for less frequently used words as well.

Reference

(OZSD) De Schryver, G.-M. (Editor). Oxford Bilingual School Dictionary: Zulu and English. First Edition. 2010, Cape Town: Oxford University Press Southern Africa.