TabularProjector
Synopsis
Search in the sections content for entries specified in a tabular text file.
Description
TabularProjector reads a list of entries from dictFile and searches for each entry key in sections contents. The format of the dictionary is one entry per line. Each line is split into columns separated by tab characters. The column specified by keyIndex will be the entry key to be searched and the other columns are data associated to the entry.
The parameters skipBlank , skipEmpty , strictColumnNumber , trimColumns , separator , multipleEntryBehaviour control how the dictionary file is read by TabularProjector .
The parameters allowJoined , allUpperCaseInsensitive , caseInsensitive , ignoreDiacritics , joinDash , matchStartCaseInsensitive , skipConsecutiveWhitespaces , skipWhitespace and wordStartCaseInsensitive control how the keys can match the sections content.
The subject parameter specifies which text of the section should be matched. There are two alternatives:
- the entries are matched on the contents of the section (the default), subject can also control if matches boundaries coincide with word delimiters;
- the entries are matched on the value of a specified feature of annotations in a given layer separated by a whitespace, in this way entries can be searched against word lemmas, for instance.
TabularProjector creates an annotation for each matched key and adds these annotations to the layer specified by targetLayer . The created annotations will have features that correspond to the entry columns. Feature keys are specified by valueFeatures . For instance if valueFeatures is [a,b,c] , then each annotation will have three features named a , b and c with the respective values of the entry’s first, second and third columns. A feature name left blank in valueFeatures will not create a feature. Thus, in order to drop the first column of the entry, valueFeatures should be [,b,c] . In addition, the created annotations will have the constant features specified in constantAnnotationFeatures .
If trieSource is specified, then TabularProjector assumes that the file contains a compiled version of the dictionary. In this case dictFile is not read.
If trieSink is specified, TabularProjector writes a compiled version of the dictionary in the file. The use of compiled dictionaries may accelerate the processing for large dictionaries.
Snippet
<tabularprojector class="TabularProjector">
<dictFile></dictFile>
<targetLayer></targetLayer>
</tabularprojector>
Mandatory parameters
dictFile
The dictionary.
targetLayer
Name of the layer that contains the match annotations.
Optional parameters
constantAnnotationFeatures
Constant features to add to each annotation created by this module.
trieSink
If set, then TabularProjector writes the compiled dictionary to the specified file.
trieSource
If set, read the compiled dictionary from the specified file. Compiled dictionaries are usually faster for large dictionaries.
valueFeatures
Target features in match annotations. The values are the columns in the entry. Ignored if headerLine is set (unless trieSource is set).
allUpperCaseInsensitive
If set to true , then allow case folding on all characters in words that are all upper case.
allowJoined
If set to true , then allow arbitrary suppression of whitespace characters in the subject. For instance, the contents aminoacid matches the key amino acid .
caseInsensitive
If set to true , then allows case folding on all characters.
documentFilter
Only process document that satisfy this expression.
headerLine
Assume the first line of the dictionary is a header, the feature values will be taken from the header line. Ignored if trieSource is set.
ignoreDiacritics
If set to true , then allow dicacritic removal on all characters. For instance the contents acide amine matches the key acide aminé .
joinDash
If set to true , then treat dash characters (-) as whitespace characters with regard to allowJoined . For instance, the contents aminoacid matches the entry amino-acid .
keyIndex
Specifies the index of the column that contains the entry key ( 0 is the first).
matchStartCaseInsensitive
If set to true , then allow case folding on the first character of the entry key.
multipleEntryBehaviour
Specifies the behavior if the lexicon contains several entries with the same key.
sectionFilter
Process only sections that satisfy this expression.
separator
Specifies the character that separates columns in dictFile .
skipBlank
In dictFile , skip lines that contain only whitespace characters.
skipConsecutiveWhitespaces
If set to true , then allow the insertion of consecutive whitespace characters in the subject. For instance, the contents amino acid matches the entry amino acid .
skipEmpty
In dictFile , skip empty lines.
skipWhitespace
If set to true , then allow arbitrary insertion of whitespace characters in the subject. For instance, the contents amino acid matches the key aminoacid .
strictColumnNumber
If set to true, then check that every line in dictFile has the same number of columns as the number of features specified in valueFeatures .
subject
Specifies the contents to match.
substituteWhitespace
If set to true , then all whitespace characters match each other (including ‘\n’, ‘\r’, ‘\t’, and non-breaking spaces).
trimColumns
If set to true , then trim leading and trailing whitespace character from column values in dictFile .
wordStartCaseInsensitive
If set to true , then allow case folding on the first character of each word.
Deprecated parameters
targetLayerName
Deprecated alias for targetLayer .