AlvisNLP

corpus processing engine

TikaReader

Synopsis

Reads PDF or DOC files and adds a document in the corpus for each file.

Description

Snippet

<tikareader class="TikaReader">
    <source></source>
</tikareader>

Mandatory parameters

source

Mandatory

Path to the source directory or source file.

Optional parameters

constantAnnotationFeatures

Optional
Type: Mapping

Constant features to add to each annotation created by this module.

constantDocumentFeatures

Optional
Type: Mapping

Constant features to add to each document created by this module.

constantSectionFeatures

Optional
Type: Mapping

Constant features to add to each section created by this module.

baseNameId

Default value: `false`
Type: Boolean

Use the filename basename as document identifier, instead of the full absolute path.

htmlLayer

Default value: `html`
Type: String

section

Default value: `text`
Type: String

Name of the single section containing the whole contents of a file.

tagFeature

Default value: `tag`
Type: String

Deprecated parameters

htmlLayerName

Deprecated
Type: String

Deprecated alias for htmlLayer .

sectionName

Deprecated
Type: String

Deprecated alias for section /

tagFeatureName

Deprecated
Type: String

Deprecated alias for tagFeature /