AlvisNLP

corpus processing engine

TreeTaggerReader

Synopsis

Read files in tree-tagger output format and creates a document for each file read.

Description

Each document contains a single section named sectionName ; its contents is constructed by concatenating the first column of each token separated with a space character.

TreeTaggerReader keeps the tree-tagger tokenization in annotations added into the layer wordLayer . The POS tag and lemma are recorded in the annotation’s posFeature and lemmaFeature features respectively.

The document identifier is the path of the corresponding file.

Snippet

<treetaggerreader class="TreeTaggerReader">
    <source></source>
</treetaggerreader>

Mandatory parameters

source

Mandatory

Path to the source directory or source file.

Optional parameters

constantAnnotationFeatures

Optional
Type: Mapping

Constant features to add to each annotation created by this module.

constantDocumentFeatures

Optional
Type: Mapping

Constant features to add to each document created by this module.

constantSectionFeatures

Optional
Type: Mapping

Constant features to add to each section created by this module.

lemmaFeature

Optional
Type: String

Name of the feature where to store word lemmas.

posFeature

Optional
Type: String

Name of the feature where to store word POS tags.

charset

Default value: `UTF-8`
Type: String

Character set of input files.

sectionName

Default value: `text`
Type: String

Name of the section of each document.

sentenceLayer

Default value: `sentences`
Type: String

Name of the layer where to store sentence annotations.

wordLayer

Default value: `words`
Type: String

Name of the layer where to store word annotations.

Deprecated parameters

lemmaFeatureKey

Deprecated
Type: String

Deprecated alias for lemmaFeature .

posFeatureKey

Deprecated
Type: String

Deprecated alias for posFeature .

sentenceLayerName

Deprecated
Type: String

Deprecated alias for sentenceLayer .

sourcePath

Deprecated

Alias for source . Use source instead.

wordLayerName

Deprecated
Type: String

Deprecated alias for wordLayer .