Corpus of Historical Low German

HeliPaD: Introduction to the corpus

The language

Old Saxon (also known as Old Low German) is a West Germanic language spoken in the area of what is now northern Germany before 1100 AD. It is usually thought to be the ancestor of the Middle Low German language, though the extent to which there is continuity between the language represented in the extant Old Saxon texts and that represented in Middle Low German texts is a matter of debate. Old Saxon is transmitted in two main texts: the Heliand (which represents the vast majority of attested Old Saxon), and a verse translation of Genesis. In addition, there are a number of shorter texts of no more than a few paragraphs each, as well as a number of glosses.

The text

The Heliand is a gospel harmony written in alliterative verse, and a very loose translation of the Latin Diatessaron. In total, 5,983 lines have been preserved, in six manuscripts: C (Cotton), M (Monacensis), S (Straubing), V (Vatican), P (Prague), and L (Leipzig). The S, V, P and L manuscripts are extremely limited in extent, and none of them contains a continuous stretch of more than a hundred lines. The M and C manuscripts are the main witnesses to the text. While the M manuscript contains a number of gaps, the C manuscript (Cotton Caligula A VII, British Library) is complete up to line 5,968. The text is divided into 71 sections, called fitts.

There exist two main editions of the Heliand: Sievers (1878), a broadly diplomatic edition of manuscripts C and M, and Behaghel (1903 and subsequent editions), the standard critical edition.

The corpus

This corpus contains all 5,968 lines of the C manuscript of the Heliand, using the Sievers (1878) edition. Compared to the standard Behaghel critical edition, this one has the advantages for linguistic research that a) it does not conflate the different forms found in different manuscripts, b) it is not as heavily emended, and c) it is now in the public domain.

The corpus is a UTF-8 plain text file designed to be searched using the program CorpusSearch 2, with the standard extension .psd, broadly following the format of the Penn Corpora of Historical English and related projects (IcePaHC, Early New High German Parsed Corpus, MCVF). It is annotated on a number of levels:

The total size of the corpus is 46,067 words (not including punctuation and code).

Orthography

The corpus character encoding is UTF-8, and contains certain special characters such as barred b and d. Word forms are kept as they are in the Sievers edition. Where words have been broken up to facilitate parsing, the site of the break is marked with a dollar sign ($).

Textual and metrical annotation

Textual and metrical annotation is POS-tagged as CODE and contained within angle brackets. The order of precedence of these elements is as they appear below.

  • Sievers edition page: e.g. P_7
  • Manuscript page: e.g. MS_5a
  • Fitt: e.g. F_1
  • Line: e.g. R_1
  • Caesura (half-line break): C
  • Other comments (mostly omissions): e.g. COM:OMISSION

Lemmatization

A significant difference between the Penn Corpora of Historical English and the HeliPaD (and a property that the HeliPaD shares with the IcePaHC) is that the HeliPaD is lemmatized. The lemma is given after the word form and separated by a hyphen: thus, for the second person singular present indicative of the verb "to be" (wesan), what is found in the corpus is bist-wesan.

Lemmas are based in form on Köbler's freely-available Old Saxon dictionary, minus length markings. To search a word when you don't know its lemma, the easiest way is to look it up in Köbler. (Note that my assignment of forms to lemmas is not always the same as Köbler's.)

Some words are, unfortunately, indistinguishable by lemma. A small minority of these are also indistinguishable by POS-tag: these include bord "edge" and bord "shield", and ger "year" and ger "spear".

In compounded words (joined by a plus sign, +), only the head of the compound is lemmatized. In practice, these are instances of prefixation with GE+ or NEG+ and can be identified morphologically.

Tokenization

A token is, broadly speaking, a main verb and everything that belongs with it. In many cases, it will be a "sentence", in pretheoretical terms. The main exception is when two independent clauses with finite verbs are conjoined, in which case these are treated as separate tokens.

The token is enclosed in brackets, and consists of a parse followed by a token ID, each of which is itself enclosed in brackets. The ID takes the form OSHeliandC.foo.bar, where foo is simply a sequential number starting at 1 and bar is the range of lines spanned by the token. For instance, OSHeliandC.265.502-503 is the ID for a token that starts on line 502 and ends on line 503 and is the 265th token in total.

HeliPaD vs. DDD

The corpus overlaps with the version of the Heliand produced as part of the Referenzkorpus Altdeutsch (DDD), but there are a number of differences. The DDD version is based on the Behaghel edition and only contains very shallow parsing. Unlike this corpus, it also contains annotation for alliteration, and indicates stem class of nouns and verbs. The two resources are thus to some extent complementary.