THE LEMMATIZATION OF OLD ENGLISH VERBS FROM THE SECOND WEAK CLASS ON A LEXICAL DATABASE

This article compiles a list of lemmas of the second class weak verbs of Old English by using the latest version of the lexical database Nerthus, which incorporates the texts of the Dictionary of Old English Corpus. Out of all the inflecional endings, the most distinctive have been selected for lemmatization: the infinitive, the inflected infinitive, the present participle, the past participle, the second person present indicative singular, the present indicative plural, the present subjunctive singular, the first and third person of preterite indicative singular, the second person of the preterite indicative singular, the preterite indicative plural and the preterite subjunctive plural. When it is necessary to regularize, normalization is restricted to correspondences based on dialectal and diachronic variation. The analysis turns out a total of 1,064 lemmas of weak verbs from the second class.

com).It contains 30,000 files of lemmatized forms, based primarily on Clark Hall and secondarily on Bosworth-Toller and Sweet.The initial headword list has been compiled by Martín Arista et al. (2011) and the meaning definitions provided by the standard dictionaries of Old English above mentioned have been synthesized by Martín Arista and Mateo Mendaza (2013).To briefly illustrate the functionalities of the version of Nerthus reviewed in this section, it may be pointed out, in the first place, that the database can turn out the number of textual occurrences of a lemma.For example, siðian appears 115 times in the texts.In the second place, the database can break down the occurrences by inflectional form.For instance, the verb wilnian occurs in the inflectional forms presented in Figure 1  Thirdly, the formalism used for representing the prefix geguarantees the direct link to the ge-prefixed counterparts of a given simple verb such as wilnian, in Figure 2.  And, fourthly, a given inflectional form, such as wilnast appears in the fragments, whose short titles are based on Mitchell, Ball and Cameron (1975), that Figure 3 includes.

AIMS AND RELEVANCE OF RESEARCH
This article deals with Old English verbs from the second weak class.Its aim is to compile a list of verbal lemmas from this morphological class based on the information provided by the version of the lexical database of Old English Nerthus, as reviewed in the previous section.Therefore, the ultimate source of the data for the analysis is the DOEC, which contains all surviving Old English texts, with a total of approximately 3,000 texts and 3 million words.
The class of the verb has been selected for the analysis because, overall, Old English verbs are morphologically more transparent than nouns and adjectives, which practically share the same endings both in the weak and the strong declension.Within verbs, the class of weak verbs, corresponding to the modern regular verbs, has been chosen rather than strong verbs, the counterpart of the modern irregular verbs.The reason is that the changes that weak verbs undergo in their inflection take place in the suffixal part of the word rather than in the root, as is the case with strong verbs.Consequently, strong verbs are harder to search.Finally, the second class of weak verbs displays fewer ambiguous inflectional endings than the first class, which can be more easily mistaken for strong verbs.
There are several reasons why it is important to gather such a list of verbal lemmata and to file them into a database.In the first place, standard dictionaries of Old English, including An Anglo-Saxon Dictionary, A Concise Anglo-Saxon Dictionary and The student's Dictionary of Anglo-Saxon are complete although they are not based on an extensive corpus of the language but on the partial list of sources provided by their prefaces or introductions.In the second place, The Dictionary of Old English is based on the corpus mentioned above, but is still in progress (the letter G was published in 2008).Thirdly, with the incorporation of the textual occurrences that correspond to each headword, Nerthus not only multiplies its size by one hundred but also changes in a qualitative way by linking dictionary forms (types) and textual forms (tokens).This, in turn, will allow us to make advances in the morphological analysis of the language and to carry out quantitative studies in textual frequency.Fourthly, the database format has clear advantages over online corpora.A database can be adapted to the specific needs of a particular research.It can be sorted and searched in ways that online corpora cannot.A database facilitates the definition of relations between data that cannot be captured by online corpora.And the database format allows us to use simultaneously the corpus, the concordance and the index of the language of analysis.Finally, this work can be seen as a contribution to the research programme in the morphology and semantics of Old English represented by Martín Arista (2008, 2010a, 2010b, 2011a, 2011b, 2011c, 2012a, 2012b, 2012c, 2013a, 2013b, 2014), Martín Arista et al. (2011), Martín Arista and Mateo Mendaza (2013), Martín Arista and Cortés Rodríguez (2014) and Martín Arista and Vea Escarza (2016).
The remainder of this article is organized as follows.Section 3 presents the morphology of the second class of weak verbs in detail.Section 4 describes the methodology of analysis, which comprises lemmatization and normalization.The different inflectional forms as they appear in the texts have to be related to an abstract form or lemma inflected for a conventional form: in the case of verbs, the infinitive.For instance, given a textual form from the corpus like hopiað, it is associated with the infinitive hopian 'to hope' by means of a process of lemmatization.Quite often, it is necessary to regularize the forms by means of a process of normalization.For example, when we come across a form like healsie we relate it to an infinitive like hã lsian 'to heal'.Section 5 presents the results of the analysis by inflectional form, lemma and normalization pattern.To close this work, Section 6 draws the main conclusions.

RELEVANT ASPECTS OF THE INFLECTION OF THE OLD ENGLISH VERB
This section deals with the characteristics of the three subclasses of weak verbs and their specific features in order to identify the most relevant features of the inflection of the verbs of the second class and to compile a list of formally distinctive inflectional endings that can be used as a starting point in the analysis.Pyles and Algeo (1982: 125) remark that weak verbs "formed their preterites and past participles in the characteristically Germanic way, by the addition of a suffix containing d or immediately after consonants, t".In contrast to strong verbs, these forms do not modify the stem of the verb.Hogg and Fulk (2011: 258) also point out that those suffixes were dental consonants with the function of marking the preterite or past tense.Thus, weak verbs added dental consonants rather than using ablaut or reduplication.In this respect, the most accepted theory is that weak verbs developed their preterite forms from a periphrasis.Pyles and Algeo (1982: 125) hold that many weak verbs were originally causative verbs derived from other categories, such as nouns or adjectives, by means of the "addition of a suffix with an i-sound that mutated the stem vowel of the word".Mitchell and Robinson (1993: 46) add that the stem vowel was normally the same throughout all the verbal forms of the paradigm, which reinforces the idea of regularity and that the inflectional endings of strong and weak verbs showed lots of similarities, although they underwent different evolutions.
Weak class 1 is one of the largest groups of verbs of all the verbal classes in Old English, among other reasons as a result of the just mentioned process of causative stem formation.Class 1 of weak verbs is subdivided into two classes, illustrated by the verbs verbs fremman 'to do' and hīeran 'to hear'.The paradigms of these weak verbs are presented in Figure 4, which is based on Mitchell and Robinson (1993: 46)  A number of weak verbs had no vowel i before the dental preterite suffix in Proto-Germanic, with the consequence that they lack umlaut in the Old English preterite and past participle.In addition, their stems all ended in -l, as presented in Figure 5, or velar consonant with the alternation of to∫ <cc> and x <h>, as shown in Figure 6 Campbell (1987: 300) remarks that the 2nd and 3rd persons of the singular (present indicative) of class 1 weak verbs are subject to assimilation.The assimilations of consonants are presented in Figure 7, with an instance of each pattern.
glencþ (infinitive glengan 'to decorate') Moving on to the characteristics of the next class, we find class 2 of weak verbs, the one on which this work focuses.Mitchell and Robinson (1993: 49) remark that this class of verbs "present few problems".As Hogg puts it (2011: 279), the peculiarity of this class of verbs relies on the fact that this was the only group of verbs which kept adding new verbs during the Old English period.The paradigms of the weak verbs lufian 'to love' (Mitchell and Robinson. 1993: 49-50), identified as 'subclass 1', and the verb lofi(g)an 'to praise' (Hogg and Fulk 2011: 279-280), identified as 'subclass 2', are presented in Figure 8  Although Hogg and Fulk (2011: 280) notice that "the inflexions of weak verbs of class 2 are, with the exceptions discussed below, the same for all stems, regardless of weight", these verbs also present some peculiarities, such as contracted forms.As a result of the loss of intervocalic h, there were two stems within paradigms like smēagan 'to consider': smēagand smēa- (Campbell 1987: 334)  The last class of weak verbs is class 3. Hogg and Fulk (2011: 289) explain that "verbs of the third weak class in Germanic are in origin structurally parallel to those of the second weak class" and that the only reason why they became a different class is a vocalic alternation in the formation of the stem.There are just four verbs in class 3, habban 'to have', libban 'to live', secg(e)an 'to say' and hycg(e)an 'to think' (Campbell 1987: 337), whose paradigms can be seen in Figure 10.

MEtHoDoLoGY
The analysis consists of two basic tasks, lemmatization and normalization.As Burkhanov (1998) explains, the first thing we should do when organizing the corpus on which a dictionary is built is to lemmatize the textual (inflected) forms found in the corpus.In this particular case, it would be verbal forms from class 2 of weak verbs.In Burkhanov's (1998: 122) words "the term 'lemmatization' is used to refer to the reduction of inflectional word forms to their lemmata, i.e. basic forms, and the elimination of homography (...) [i]n practice, lemmatization involves the assignment of a uniform heading under which elements of the corpora containing the word forms of same lexeme are represented."In this respect, Atkins and Rundell (2008: 325) point out that the headword "links all the information about one word together in one entry.In it goes the canonical form [italics as in the original] of the headword: the singular of nouns, the infinitive of verbs, the uninflected form of adjectives and adverbs, and so on".Furthermore, as Jackson (2002: 179) puts it, "the criteria for determining what is a headword have important consequences for lexical description as well as for accessibility".
In order to find the inflected forms of class 2 weak verbs, it is necessary first of all to choose a set of inflectional endings of these verbs that are representative of their morphology and are not found as inflectional endings in any other classes.The inflections of class 2 weak verbs selected for lemmatization are the infinitive (-ian), the inflected infinitive (-ianne), the present participle (-iende), the past participle (ge-od), the first person singular of the present indicative (-ie/ge-ige) the second person singular of the present indicative (-ast), the present indicative plural (-iað/-iaþ), the present subjunctive singular (-ie/ge-ige), the first/third person singular of the preterite indicative (-ode), the second person singular of the preterite indicative (-odest), the preterite indicative plural (-odon) and the preterite subjunctive plural (-oden).That is, the -i-and -o-, characteristic of the second class, that present in the inflectional endings are taken as a distinctive feature that allows us to identify the verbal forms under analysis without ambiguity.These forms comprise the singular and the plural number, the finite and non-finite forms of the verb, the indicative and the subjunctive mood and the present and the preterite tense.Last but not least, these forms are also valid for looking for contracted verbs.
The next step of the analysis is to extract the words ending with these inflections from the DOEC.This has not been done by means of the search engine provided by the online corpus but on the lexical database of Old English Nerthus, which comprises, as has been remarked in Section 1, a concordance by fragment and by word of the whole corpus, an index with the number occurrences of all the corpus that lists around 187,000 inflectional forms and a 30,000 file database.The database format has a great advantage over the online corpus: it can search the results of previous searches.Thus, the process of lemma assignment advances on the basis of succesive searches that refine little by little the results.With query strings like ==*ode, ==*ian, ==*iað, ==*iaþ, ==*ie, ==*ode, ==*ie and ==*iende the database turns out verbal forms such as hogode, hogian, hogiað, hogiaþ, hogie, gehogode, gehagie and hogiende respectively.In the process of lemmatization, these inflectional forms are grouped under the basic form or lemma of hogian(ge) (2 occurrences).This does not mean that this process is automatic.In the first place, many undesired results are turned out if the query segment is very short or unspecific.This is the reason why the endings -ige and -od have been searched only in combination with the prefix ge-, thus ge-ige, ge-od.In the second place, manual work is also needed to find forms that deviate from the paradigms provided by grammars, which tend to represent Early West-Saxon.
At this point, some sort of regularization is necessary that accomodates diachronic, dialectal or textual variants to the grammatical model.Normalization is, in fact, a part of the process of lemmatization and consists of the regularization of non-standard spellings.As Sweet (1976: xi) explains it, "it is often necessary to put the word where the user of the dictionary expects to find it.Therefore, when several spellings of a word appear in the texts, it is necessary to opt for one of them in a consistent way".For instance, inflected forms such as hersumie or gehersumiað are found under the lemma hīersumian(ge) (2 occurrences).A Concise Anglo-Saxon Dictionary provides an extensive list of the correspondences it uses for the normalization of Old English texts, but this list has not been used as such because it overnormalizes has many circularities.Instead, the only correspondences that have been selected are those idenfied by Stark (1982) and de la Cruz (1986) as constituting instances of dialectal or diachronic variation.The dialect of reference is West-Saxon, in which most surviving texts are written.
Finally, the dictionaries have been necessary for assigning vowel length to lemmas because DOEC does not mark vowel length.The following section presents the results of the application of the methodology just described.

ConCLUSion
This article has compiled a list of lemmas of the second class weak verbs of Old English by using the latest version of the lexical database Nerthus, which incorporates the texts of the DOEC.Since this is the beginning of the lemmatization task of the Nerthus Project, the most transparent morphological class has been chosen for the analysis, the class 2 weak verb.Out of all the inflecional endings, the most distinctive have been selected for lemmatization: the infinitive (-ian), the inflected infinitive (-ianne), the present participle (-iende), the past participle (ge-od), the second person present indicative singular (-ast), the present indicative plural (-iað/-iaþ), the present subjunctive singular (-ie/ge-ige), the first and third person of preterite indicative singular (-ode), the second person of the preterite indicative singular (-odest), the preterite indicative plural (-odon) and the preterite subjunctive plural (-oden).A total of 187,000 inflectional forms have been searched for these endings.The searches have been launched on the lexical database of Old English Nerthus, which has also filed the results of this analysis and provided a reference list of class 2 weak verbs extracted from its 30,000 word list of lexemes.When it has been necessary to regularize, normalization has been restricted to a number of correspondences based on dialectal and diachronic variation.
A total of 1,064 lemmas of weak verbs from the second class have been found, of which 285 were not on the reference list of Nerthus.Since Nerthus is based on the standard dictionaries of Old English and provides the information of the dictionary by Clark Hall on an exhaustive way, it seems reasonable to draw the conclusion that after this analysis we have a more accurate knowledge of the relationship between Old English texts and the dictionaries of the language as regards the second class of weak verbs.Moreover, but for The Dictionary of Old English, dictionary entries do not contain inflectional forms.Given that The Dictionary of Old English has published until the letter G only, the analysis of the letters H-Y that has been carried out in this work may be seen as a contribution to the field.Apart from proposing lemmas, this work has also helped to improve the information on some lemmas that already appear in dictionaries.This is the case with verbs to which, given the textual evidence, it is necessary to add the prefix ge-, as, for instance, āmerian, blyssian, cwylmian, dwelian, fynegian, langian, etc.

Figure 4 .
Figure 4.The paradigm of class 1 weak verbs fremman 'to do' and hīeran 'to hear'.

Figure 7 .
Figure 7. Assimilation in te 2nd.and 3rd.person of the singular number.