AlvisNLP

corpus processing engine

TextFileReader

Synopsis

Reads files and adds a document in the corpus for each file.

Description

TextFileReader reads file(s) from source and creates a document in the corpus for each file. The identifier of the created document is the absolute path of the corresponding file. The created document has a single section named section whose contents is the contents of the corresponding file.

If source is a path to a file, then TextFileReader will read this file. If source is a path to a directory, then TextFileReader will read the files in this directory.

If linesLimit is set, then TextFileReader creates a new document for each set of lines. For instance, if linesLimit is set to 10 and a file contains 25 lines, then 3 documents are created: two containing 10 lines and one containing the las 5 lines.

Files are read using the same encoding charset .

The created documents will all have the features defined in constantDocumentFeatures . The unique section will have the features defined in constantSectionFeatures .

Snippet

<textfilereader class="TextFileReader">
    <source></source>
</textfilereader>

Mandatory parameters

source

Mandatory

Path to the source directory or source file.

Optional parameters

constantDocumentFeatures

Optional
Type: Mapping

Constant features to add to each document created by this module.

constantSectionFeatures

Optional
Type: Mapping

Constant features to add to each section created by this module.

linesLimit

Optional
Type: Integer

Maximum number of lines per document.

sizeLimit

Optional
Type: Integer

Maximum number of characters per document. No limit if not set.

baseNameId

Default value: `false`
Type: Boolean

Use the filename base name instead of the full path as document identifier.

charset

Default value: `UTF-8`
Type: String

Character set of the input files.

section

Default value: `text`
Type: String

Name of the single section containing the whole contents of a file.

Deprecated parameters

sectionName

Deprecated
Type: String

Deprecated alias for section .

sourcePath

Deprecated

Alias for source . Use source instead.