TreeTagger
Synopsis
Runs tree-tagger .
Description
TreeTagger applies tree-tagger on annotations in wordLayer by generating an appropriate input file. This file will contain one line for each annotation. The first column, the token surface form, is the value of the formFeature feature. The second column, the token predefined POS tag, is the value posFeature feature. The third column, the token predefined lemma, is the value of lemmaFeature feature. If posFeature or lemmaFeature are not defined, then the second and third column are left blank.
The tree-tagger binary is specified by treeTaggerExecutable and the language model to use is specified by parFile . Additionally a lexicon file can be given through lexiconFile .
If sentenceLayer is defined, then TreeTagger considers annotations in this layer as sentences. Sentence boundaries are reinforced by providing tree-tagger an additional end-of-sentence marker.
Once tree-tagger has processed the corpus, TreeTagger adds the predicted POS tag and lemma to the respective posFeature and lemmaFeature features of the corresponding annotations.
If recordDir and recordFeatures are both defined, then tree-tagger predictions are written into files in one file per section in the recordDir directory. recordFeatures is an array of feature names to record. An additional feature n is recognized as the annotation ordinal in the section.
Snippet
<treetagger class="TreeTagger">
<parFile></parFile>
<treeTaggerExecutable></treeTaggerExecutable>
</treetagger>
Mandatory parameters
parFile
Path to the language model file.
treeTaggerExecutable
Path to the tree-tagger executable file.
Optional parameters
constantAnnotationFeatures
Constant features to add to each annotation created by this module.
lexiconFile
Path to a tree-tagger lexicon file, if set the lexicon will be applied to the corpus before treetagger processes it.
recordDir
Path to the directory where to write tree-tagger result files (one file per section).
recordFeatures
List of attributes to display in result files.
documentFilter
Only process document that satisfy this expression.
formFeature
Name of the feature denoting the token surface form.
inputCharset
Tree-tagger input corpus character set.
lemmaFeature
Name of the feature to set with the lemma.
noUnknownLemma
Either to replace unknown lemmas with the surface form.
outputCharset
Tree-tagger output character set.
posFeature
Name of the feature to set with the POS tag.
recordCharset
Character encoding of the result files.
sectionFilter
Process only sections that satisfy this expression.
sentenceLayer
Name of the layer containing sentence annotations, sentences are reinforced.
wordLayer
Name of the layer containing the word annotations.
Deprecated parameters
sentenceLayerName
Deprecated alias for sentenceLayer .
wordLayerName
Deprecated alias for wordLayer .