Corpus of Historical Low German

HeliPaD and the Penn Historical Corpora


This page is for experienced users of other Penn historical corpora such as the YCOE, PPCME2, IcePaHC, ENHG Parsed Corpus, etc. If you're a novice user of such corpora, you'd do better to consult the full POS annotation manual and syntactic annotation manual. On the whole, the HeliPaD closely follows the conventions of the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE).

In the HeliPaD manual, where the HeliPaD follows other Penn historical corpora, the text will be marked like this.

Where the HeliPaD does its own thing, on the other hand, the text will be marked like this.

This page is a one-stop shop for the differences between the HeliPaD and the YCOE.

General differences

The corpus is in UTF-8, and thus, as in the IcePaHC but unlike in the YCOE, special characters such as barred b and d are not given any special annotation: they are simply present in the text. Like the IcePaHC, the HeliPaD is also lemmatized, with the lemma given after the word form and separated by a hyphen.

The corpus contains specific textual and metrical annotation, which, however, follows the general principles of the Penn corpora: these annotations are enclosed in angle brackets and receive the POS-tag CODE.

Morphological differences

Major differences

The most major difference between the HeliPaD and the YCOE is that the HeliPaD makes much more extensive use of attributes. In the YCOE, case is the only attribute that is regularly annotated, and is indicated by means of a caret delimiter (e.g. N^D for a dative noun). The HeliPaD extends the logic of this approach and adds person and number. Where the three attributes co-occur, case precedes person precedes number. All nominal elements are annotated for case and number; pronouns are also annotated for person, e.g. ik would be PRO^N^1^SG. All finite verbs are annotated for person and number. For full details see Additional attributes. The overall approach is "maximalist": any element that can receive attributes must receive them, even where there is formal ambiguity. Ambiguity mostly relates to case, and the hierarchy of preference N over A over D over G over I is followed. Where there is number ambiguity, SG is preferred over PL is preferred over DU.

In connection with the attributes, the HeliPaD has an extra tag for each type of participle when it is inflected: VNI alongside VN, VGI alongside VG, etc. These forms take attributes exactly as adjectives do. If it is possible to treat an adjective as a participial form, I have done so. Formally uninflected elements are treated as VN, VG etc.; only participles with agreement endings are tagged as VNI, VGI etc.

The Penn corpora treat many subordinators as prepositions, with corresponding levels of structure. There is no rationale for this in Old Saxon, where prepositions and subordinators are almost completely distinct classes. Those subordinators that are homophonous with adverbs are tagged as such, and form ADVPs in SpecCP. Other subordinators are treated as C elements. Three P elements - butan, newa and newan - can take a CP complement, though newa is the only one that usually introduces clauses without that, and is usually treated as C.

Morphologically wh- elements such as wh-indefinites like hwilik are tagged using the W* tags even when they are not part of an extraction structure. Since Old Saxon is particularly flexible in using wh-words as indefinites, this is quite important. Syntactic role is disambiguated at phrasal level.

Minor differences

Forms of werthan are tagged RD*/RG*/RN*, as in the IcePaHC and ENHG Parsed Corpus.

Proper nouns are tagged as NPR (not NR as in the YCOE).

Adverbs do not bear extended tags ^T, ^L and ^D for temporal, locative and directional, as they do in the YCOE. This information is retrievable from the phrasal extended label and from the lemma.

The word ok is tagged ALSO (the cognate is ADV in the YCOE). ALSO does not head a phrase, may modify adjectives, and often co-occurs with conjunctions within a CONJP.

Inflected infinitives are not given special treatment, unlike in the YCOE. They can always be retrieved due to their co-occurrence with TO within an IP-INF.

The ambiguity tags VBP and VBD etc., for formally ambiguous indicative/subjunctive/imperative verbs, are not used. (Verb form classification follows Köbler.)

The tag AX*, for auxiliary verbs, is not used.

The FP (focus particle) and XX (problematic word) tags are not used in the HeliPaD, mainly since there is no call for them in the current material.

The tag RP is closed class, and used for the particles an, to, up and ut. It does not occur prefixed to verbs as in the YCOE.

The tag GE has a one-to-one mapping with the prefix gi-. It never occurs independently, but always prefixed/cliticized to a verbal form. Nominal gi- is not tagged in this way, except in the context of gihwilik and gihwe.

Unlike in the YCOE, where a weak adjective is used nominally (i.e. without a noun head), it normally retains its adjectival tag.

The adjectives mikil and luttil are tagged as adjectives, even when they are clearly quantifiers. Cognates in the YCOE and other Penn corpora are treated in the exact opposite way.

Some apparently quantificational elements such as wiht, eowiht etc. are treated as nouns in the HeliPaD rather than as quantifiers as the corresponding items are in the YCOE.

al may be treated as an adverb (ADV) in some cases, particularly when introducing al so-clauses.

Syntactic differences

Major, following Penn

In general, nominal extended labels for arguments in the HeliPaD work like they do in the PPCME2 and IcePaHC and not as they do in the YCOE. The nominal extended labels for arguments are -SBJ, -OB1, -OB2, and -PRD. These replace the YCOE's phrasal case labels, though the two types of object are not used in exactly the same way (see below). See Noun phrase extended labels and the following subsections for detail.

Within other phrases, excluding IPs, PTPs and CPs but including noun phrases, NPs are indicated as possessive (NP-POS) if this is their function (as it usually is within NPs themselves), and unmarked otherwise (as is always the case within PPs). This differs from the approach taken in the YCOE, which labels constituents for case. The default treatment of NPs is to attach them as high as possible in the structure: this means treating constituents as NP-ADT rather than arguments of a non-verbal element.

Major, different policy

In the YCOE, nominal appositive constituents are either contained within, or indexed to, the constituent to which they are in apposition. In the HeliPaD, they are instead treated as sisters to that constituent (or indexed to a sisterhood position). This enables a much less cluttered clausal representation, and the intended apposition relations are usually very clear semantically.

Unlike in the YCOE, traces are marked with all the same extended labels as the moved element itself. The extended labels of these traces are identical in every way to those of overt constituents.

The HeliPaD's approach to conjunction differs in three important ways from that of the YCOE and other Penn corpora. First, single-word conjuncts are treated in exactly the same way as other conjuncts. Secondly, any extended labels borne by the root note are inherited by the two conjuncts. Thirdly, shared pre- and post-head modifiers are simply included under the root node. For examples see the section on Conjunction.

The two types of object -OB1 and -OB2 are used, broadly speaking, for accusative objects and for dative objects respectively. With certain verbs, genitive objects can also be either -OB1 or -OB2, depending on the case of the other object. Consult the treatment of individual words for details. With one verb, lerian, both objects can be accusative objects, with the people being taught as -OB2.

When modifiers (for instance, floated quantifiers) are separated from a head with which they agree, these are traced to the head. Unlike in the YCOE, this is the case regardless of whether they are case-marked.

Minor differences

Arbitrary PRO in ECM infinitives is indicated by *arb*, as it is in the PPCME2 and IcePaHC (but not the YCOE).

In the HeliPaD, single-word modifiers do not project a phrase, even when they follow the head.

If an NP immediately dominates only a modified modifier, the modifier is treated as the head, and the extra level omitted: for instance, an NP headed by a quantifier that is itself modified by an adverb.

ADJPs can be headed only by adjectives, inflected participles, and possessive pronouns, and not by participle phrases or by quantifiers as in the YCOE.

In the YCOE, the first independent clause following a verb of saying is included in the parse as the complement of the verb of saying, whereas later independent clauses are treated as separate tokens. In the HeliPaD, these clauses are always treated as independent tokens.

In the HeliPaD, unlike in the YCOE, the label -LFD is used systematically with all clausal categories, including CP-ADV (e.g. in if ... then constructions).

Raising to subject is not usually explicitly represented in the HeliPaD.

For IPs, the extended label -SUB-CON is not used in the HeliPaD, as it is redundant and all instances of conjoined subordinate clauses can be retrieved in other ways. The label -ABS (for infinitival absolutes) is also not used.

For CPs, the HeliPaD does not include -CAR (clause-adjoined relatives), -CLF (clefts), -EXL (exclamatives), and -EOP (gapped infinitival relative/purpose clauses), as these structures are essentially not found in the Heliand.

In the HeliPaD, non-argumental that-clauses (e.g. with resultative meaning) are still labelled CP-THT and not CP-ADV. The HeliPaD is more liberal than the YCOE in its use of CP-THT, which is used broadly for any clause introduced by that, as well as some in the scope of negation introduced by ne, unless they are instances of CP-DEG. CP-THT is thus essentially a formal rather than functional label. CP-THT does not need to be the complement of a verb or adjective, and is usually unindexed at IP level.

Purpose clauses headed by a genitive demonstrative are not treated as CP-ADV but as CP-FRL-ADT.

Titles are never labelled as separate appositive phrases in the HeliPaD, whether modified or not.

NP-COM is not used, and nouns are assumed not to have nominal complements. NP-POS or clause-level adjuncts take over much of the work of this label in the HeliPaD.

True case attraction is never annotated at phrase level (since the HeliPaD does not have phrase-level case labels), but there may be a mismatch between (gap) grammatical function and word-level case if a treatment as apposition would lead to a relative clause with both an empty complementizer and an empty operator.

Adverbs are not treated as taking complements in HeliPaD. Their apparent complements are parsed as clausal adjuncts or arguments.