AlvisNLP

corpus processing engine

TabularProjector

Synopsis

Search in the sections content for entries specified in a tabular text file.

Description

TabularProjector reads a list of entries from dictFile and searches for each entry key in sections contents. The format of the dictionary is one entry per line. Each line is split into columns separated by tab characters. The column specified by keyIndex will be the entry key to be searched and the other columns are data associated to the entry.

The parameters skipBlank , skipEmpty , strictColumnNumber , trimColumns , separator , multipleEntryBehaviour control how the dictionary file is read by TabularProjector .

The parameters allowJoined , allUpperCaseInsensitive , caseInsensitive , ignoreDiacritics , joinDash , matchStartCaseInsensitive , skipConsecutiveWhitespaces , skipWhitespace and wordStartCaseInsensitive control how the keys can match the sections content.

The subject parameter specifies which text of the section should be matched. There are two alternatives:

TabularProjector creates an annotation for each matched key and adds these annotations to the layer specified by targetLayer . The created annotations will have features that correspond to the entry columns. Feature keys are specified by valueFeatures . For instance if valueFeatures is [a,b,c] , then each annotation will have three features named a , b and c with the respective values of the entry’s first, second and third columns. A feature name left blank in valueFeatures will not create a feature. Thus, in order to drop the first column of the entry, valueFeatures should be [,b,c] . In addition, the created annotations will have the constant features specified in constantAnnotationFeatures .

If trieSource is specified, then TabularProjector assumes that the file contains a compiled version of the dictionary. In this case dictFile is not read.

If trieSink is specified, TabularProjector writes a compiled version of the dictionary in the file. The use of compiled dictionaries may accelerate the processing for large dictionaries.

Snippet

<tabularprojector class="TabularProjector">
    <dictFile></dictFile>
    <targetLayer></targetLayer>
</tabularprojector>

Mandatory parameters

dictFile

Mandatory

The dictionary.

targetLayer

Mandatory
Type: String

Name of the layer that contains the match annotations.

Optional parameters

constantAnnotationFeatures

Optional
Type: Mapping

Constant features to add to each annotation created by this module.

trieSink

Optional
Type: OutputFile

If set, then TabularProjector writes the compiled dictionary to the specified file.

trieSource

Optional
Type: InputFile

If set, read the compiled dictionary from the specified file. Compiled dictionaries are usually faster for large dictionaries.

valueFeatures

Optional
Type: String[]

Target features in match annotations. The values are the columns in the entry. Ignored if headerLine is set (unless trieSource is set).

allUpperCaseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allow case folding on all characters in words that are all upper case.

allowJoined

Default value: `false`
Type: Boolean

If set to true , then allow arbitrary suppression of whitespace characters in the subject. For instance, the contents aminoacid matches the key amino acid .

caseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allows case folding on all characters.

documentFilter

Default value: `true`
Type: Expression

Only process document that satisfy this expression.

headerLine

Default value: `false`
Type: Boolean

Assume the first line of the dictionary is a header, the feature values will be taken from the header line. Ignored if trieSource is set.

ignoreDiacritics

Default value: `false`
Type: Boolean

If set to true , then allow dicacritic removal on all characters. For instance the contents acide amine matches the key acide aminé .

joinDash

Default value: `false`
Type: Boolean

If set to true , then treat dash characters (-) as whitespace characters with regard to allowJoined . For instance, the contents aminoacid matches the entry amino-acid .

keyIndex

Default value: `0`
Type: Integer[]

Specifies the index of the column that contains the entry key ( 0 is the first).

matchStartCaseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allow case folding on the first character of the entry key.

multipleEntryBehaviour

Default value: `all`

Specifies the behavior if the lexicon contains several entries with the same key.

sectionFilter

Default value: `true`
Type: Expression

Process only sections that satisfy this expression.

separator

Default value: ` `
Type: Character

Specifies the character that separates columns in dictFile .

skipBlank

Default value: `false`
Type: Boolean

In dictFile , skip lines that contain only whitespace characters.

skipConsecutiveWhitespaces

Default value: `false`
Type: Boolean

If set to true , then allow the insertion of consecutive whitespace characters in the subject. For instance, the contents amino acid matches the entry amino acid .

skipEmpty

Default value: `false`
Type: Boolean

In dictFile , skip empty lines.

skipWhitespace

Default value: `false`
Type: Boolean

If set to true , then allow arbitrary insertion of whitespace characters in the subject. For instance, the contents amino acid matches the key aminoacid .

strictColumnNumber

Default value: `true`
Type: Boolean

If set to true, then check that every line in dictFile has the same number of columns as the number of features specified in valueFeatures .

subject

Default value: `WORD`
Type: Subject

Specifies the contents to match.

substituteWhitespace

Default value: `false`
Type: Boolean

If set to true , then all whitespace characters match each other (including ‘\n’, ‘\r’, ‘\t’, and non-breaking spaces).

trimColumns

Default value: `false`
Type: Boolean

If set to true , then trim leading and trailing whitespace character from column values in dictFile .

wordStartCaseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allow case folding on the first character of each word.

Deprecated parameters

targetLayerName

Deprecated
Type: String

Deprecated alias for targetLayer .