AlvisNLP

corpus processing engine

TokenizedReader

Synopsis

Reads a tokenized corpus: one token per line, empty line separates sentence.

Description

Reads a tokenized corpus: one token per line, empty line separates sentence.

Snippet

<tokenizedreader class="TokenizedReader">
    <source></source>
</tokenizedreader>

Mandatory parameters

source

Mandatory

Path to the file or directory containing the tokenized text.

Optional parameters

constantAnnotationFeatures

Optional
Type: Mapping

Constant features to add to each annotation created by this module.

constantDocumentFeatures

Optional
Type: Mapping

Constant features to add to each document created by this module.

constantSectionFeatures

Optional
Type: Mapping

Constant features to add to each section created by this module.

section

Default value: `text`
Type: String

Name of the section containing the tokenized text.

sentenceLayer

Default value: `sentences`
Type: String

Name of the sentence layer.

tokenLayer

Default value: `words`
Type: String

Name of the token layer.

Deprecated parameters

sectionName

Deprecated
Type: String

Deprecated alias for section .

sentenceLayerName

Deprecated
Type: String

Deprecated alias for sentenceLayer .

tokenLayerName

Deprecated
Type: String

Deprecated alias for tokenLayer .