AlvisNLP

corpus processing engine

TreeTagger

Synopsis

Runs tree-tagger .

Description

TreeTagger applies tree-tagger on annotations in wordLayer by generating an appropriate input file. This file will contain one line for each annotation. The first column, the token surface form, is the value of the formFeature feature. The second column, the token predefined POS tag, is the value posFeature feature. The third column, the token predefined lemma, is the value of lemmaFeature feature. If posFeature or lemmaFeature are not defined, then the second and third column are left blank.

The tree-tagger binary is specified by treeTaggerExecutable and the language model to use is specified by parFile . Additionally a lexicon file can be given through lexiconFile .

If sentenceLayer is defined, then TreeTagger considers annotations in this layer as sentences. Sentence boundaries are reinforced by providing tree-tagger an additional end-of-sentence marker.

Once tree-tagger has processed the corpus, TreeTagger adds the predicted POS tag and lemma to the respective posFeature and lemmaFeature features of the corresponding annotations.

If recordDir and recordFeatures are both defined, then tree-tagger predictions are written into files in one file per section in the recordDir directory. recordFeatures is an array of feature names to record. An additional feature n is recognized as the annotation ordinal in the section.

Snippet

<treetagger class="TreeTagger">
    <parFile></parFile>
    <treeTaggerExecutable></treeTaggerExecutable>
</treetagger>

Mandatory parameters

parFile

Mandatory
Type: InputFile

Path to the language model file.

treeTaggerExecutable

Mandatory

Path to the tree-tagger executable file.

Optional parameters

constantAnnotationFeatures

Optional
Type: Mapping

Constant features to add to each annotation created by this module.

lexiconFile

Optional

Path to a tree-tagger lexicon file, if set the lexicon will be applied to the corpus before treetagger processes it.

recordDir

Optional

Path to the directory where to write tree-tagger result files (one file per section).

recordFeatures

Optional
Type: String[]

List of attributes to display in result files.

documentFilter

Default value: `true`
Type: Expression

Only process document that satisfy this expression.

formFeature

Default value: `form`
Type: String

Name of the feature denoting the token surface form.

inputCharset

Default value: `ISO-8859-1`
Type: String

Tree-tagger input corpus character set.

lemmaFeature

Default value: `lemma`
Type: String

Name of the feature to set with the lemma.

noUnknownLemma

Default value: `false`
Type: Boolean

Either to replace unknown lemmas with the surface form.

outputCharset

Default value: `ISO-8859-1`
Type: String

Tree-tagger output character set.

posFeature

Default value: `pos`
Type: String

Name of the feature to set with the POS tag.

recordCharset

Default value: `UTF-8`
Type: String

Character encoding of the result files.

sectionFilter

Default value: `true`
Type: Expression

Process only sections that satisfy this expression.

sentenceLayer

Default value: `sentences`
Type: String

Name of the layer containing sentence annotations, sentences are reinforced.

wordLayer

Default value: `words`
Type: String

Name of the layer containing the word annotations.

Deprecated parameters

sentenceLayerName

Deprecated
Type: String

Deprecated alias for sentenceLayer .

wordLayerName

Deprecated
Type: String

Deprecated alias for wordLayer .