AlvisNLP

corpus processing engine

XLSProjector

Synopsis

Projects rows in XLS or XLSX files on sections.

This module is experimental.

Description

XLSProjector reads xlsFile in Microsoft Excel XLS or XLSX formats and searches for row entries in sections.

The parameters allowJoined , allUpperCaseInsensitive , caseInsensitive , ignoreDiacritics , joinDash , matchStartCaseInsensitive , skipConsecutiveWhitespaces , skipWhitespace and wordStartCaseInsensitive control the matching between the section and the entry keys.

The subject parameter specifies which text of the section should be matched. There are two options:

XLSProjector creates an annotation for each matched row and adds these annotations to the layer named targetLayer . The created annotations will have features whose keys correspond to valueFeatures and values to the data associated to the matched entry (columns in the XLS file). For instance if valueFeatures is [a,b,c] , then each annotation will have three features named a , b and c with the respective values of the entry’s second, third and fourth columns. A feature name left blank in valueFeatures will not create a feature. Thus, in order not to keep the entry in the a feature, valueFeatures should be [,b,c] . In addition, the created annotations will have the feature keys and values defined in constantAnnotationFeatures .

If specified, then XLSProjector assumes that trieSource contains a compiled version of the dictionary. xlsFile is not read. If specified, XLSProjector writes a compiled version of the dictionary in trieSink . The use of compiled dictionaries may accelerate the processing for large dictionaries.

Snippet

<xlsprojector class="XLSProjector">
    <targetLayer></targetLayer>
    <valueFeatures></valueFeatures>
    <xlsFile></xlsFile>
</xlsprojector>

Mandatory parameters

targetLayer

Mandatory
Type: String

Name of the layer that contains the match annotations.

valueFeatures

Mandatory
Type: String[]

Target features in match annotations. The values are the columns in the matched entry line.

xlsFile

Mandatory

Path to the source XLS files.

Optional parameters

constantAnnotationFeatures

Optional
Type: Mapping

Constant features to add to each annotation created by this module.

trieSink

Optional
Type: OutputFile

If set, then XLSProjector writes the compiled dictionary to the specified file.

trieSource

Optional
Type: InputFile

If set, read the compiled dictionary from the specified file. Compiled dictionaries are usually faster for large dictionaries.

allUpperCaseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allow case folding on all characters in words that are all upper case.

allowJoined

Default value: `false`
Type: Boolean

If set to true , then allow arbitrary suppression of whitespace characters in the subject. For instance, the contents aminoacid matches the key amino acid .

caseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allows case folding on all characters.

documentFilter

Default value: `true`
Type: Expression

Only process document that satisfy this expression.

headerRow

Default value: `false`
Type: Boolean

Either to skip the first row of each sheet.

ignoreDiacritics

Default value: `false`
Type: Boolean

If set to true , then allow dicacritic removal on all characters. For instance the contents acide amine matches the key acide aminé .

joinDash

Default value: `false`
Type: Boolean

If set to true , then treat dash characters (-) as whitespace characters with regard to allowJoined . For instance, the contents aminoacid matches the entry amino-acid .

keyIndex

Default value: `0`
Type: Integer[]

Specifies the key column index (starting at 0).

matchStartCaseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allow case folding on the first character of the entry key.

multipleEntryBehaviour

Default value: `all`

Specifies the behavior if the lexicon contains several entries with the same key.

sectionFilter

Default value: `true`
Type: Expression

Process only sections that satisfy this expression.

sheets

Default value: `0`
Type: Integer[]

Index of the sheets to apply (starting at 0).

skipConsecutiveWhitespaces

Default value: `false`
Type: Boolean

If set to true , then allow the insertion of consecutive whitespace characters in the subject. For instance, the contents amino acid matches the entry amino acid .

skipWhitespace

Default value: `false`
Type: Boolean

If set to true , then allow arbitrary insertion of whitespace characters in the subject. For instance, the contents amino acid matches the key aminoacid .

subject

Default value: `WORD`
Type: Subject

Specifies the contents to match.

substituteWhitespace

Default value: `false`
Type: Boolean

If set to true , then all whitespace characters match each other (including ‘\n’, ‘\r’, ‘\t’, and non-breaking spaces).

wordStartCaseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allow case folding on the first character of each word.

Deprecated parameters

targetLayerName

Deprecated
Type: String

Deprecated alias for targetLayer .