TikaReader
Synopsis
Reads PDF or DOC files and adds a document in the corpus for each file.
Description
Snippet
<tikareader class="TikaReader">
<source></source>
</tikareader>
Mandatory parameters
source
Mandatory
Type: SourceStream
Path to the source directory or source file.
Optional parameters
constantAnnotationFeatures
Optional
Type: Mapping
Constant features to add to each annotation created by this module.
constantDocumentFeatures
Optional
Type: Mapping
Constant features to add to each document created by this module.
constantSectionFeatures
Optional
Type: Mapping
Constant features to add to each section created by this module.
baseNameId
Default value: `false`
Type: Boolean
Use the filename basename as document identifier, instead of the full absolute path.
htmlLayer
Default value: `html`
Type: String
section
Default value: `text`
Type: String
Name of the single section containing the whole contents of a file.
tagFeature
Default value: `tag`
Type: String
Deprecated parameters
htmlLayerName
Deprecated
Type: String
Deprecated alias for htmlLayer .
sectionName
Deprecated
Type: String
Deprecated alias for section /
tagFeatureName
Deprecated
Type: String
Deprecated alias for tagFeature /