YateaTermsProjector
Synopsis
Search in the sections content for terms extracted by YaTeA (see YateaExtractor ).
Description
YateaTermsProjector reads terms in a YaTeA XML output file produced by YateaExtractor and searches for terms in section contents, or whatever specified by subject .
The parameters allowJoined , allUpperCaseInsensitive , caseInsensitive , ignoreDiacritics , joinDash , matchStartCaseInsensitive , skipConsecutiveWhitespaces , skipWhitespace and wordStartCaseInsensitive control how the keys can match the sections content.
The subject parameter specifies which text of the section should be matched. There are two alternatives:
- the entries are matched on the contents of the section (the default), subject can also control if matches boundaries coincide with word delimiters;
- the entries are matched on the value of a specified feature of annotations in a given layer separated by a whitespace, in this way entries can be searched against word lemmas, for instance.
YateaTermsProjector creates an annotation for each matched key and adds these annotations to the layer specified by targetLayer . Term structure information can be recorded in the features specified by termIdFeature , headFeature , monoHeadIdFeature , modifierFeature , and termPosFeature . In addition, the created annotations will have the constant features specified in constantAnnotationFeatures .
trieSource and trieSink are not supported by YateaTermsProjector .
Snippet
<yateatermsprojector class="YateaTermsProjector">
<targetLayer></targetLayer>
<yateaFile></yateaFile>
</yateatermsprojector>
Mandatory parameters
targetLayer
Name of the layer that contains the match annotations.
yateaFile
YaTeA output XML file, as produced by YateaExtractor .
Optional parameters
constantAnnotationFeatures
Constant features to add to each annotation created by this module.
trieSink
If set, then YateaTermsProjector writes the compiled dictionary to the specified file.
trieSource
If set, read the compiled dictionary from the specified file. Compiled dictionaries are usually faster for large dictionaries.
allUpperCaseInsensitive
If set to true , then allow case folding on all characters in words that are all upper case.
allowJoined
If set to true , then allow arbitrary suppression of whitespace characters in the subject. For instance, the contents aminoacid matches the key amino acid .
caseInsensitive
If set to true , then allows case folding on all characters.
documentFilter
Only process document that satisfy this expression.
headFeature
Feature where to record the matched term’s head identifier.
ignoreDiacritics
If set to true , then allow dicacritic removal on all characters. For instance the contents acide amine matches the key acide aminé .
joinDash
If set to true , then treat dash characters (-) as whitespace characters with regard to allowJoined . For instance, the contents aminoacid matches the entry amino-acid .
matchStartCaseInsensitive
If set to true , then allow case folding on the first character of the entry key.
mnpOnly
If true , then YateaTermsProjector only searches for MNP terms.
modifierFeature
Feature where to record the matched term’s modifier identifier.
monoHeadIdFeature
Feature where to record the matched term’s mono-head (or superhead, or single-token head) identifier.
multipleEntryBehaviour
Specifies the behavior if the lexicon contains several entries with the same key.
projectLemmas
If true , the this searches for term lemmas instead of surface forms.
sectionFilter
Process only sections that satisfy this expression.
skipConsecutiveWhitespaces
If set to true , then allow the insertion of consecutive whitespace characters in the subject. For instance, the contents amino acid matches the entry amino acid .
skipWhitespace
If set to true , then allow arbitrary insertion of whitespace characters in the subject. For instance, the contents amino acid matches the key aminoacid .
subject
Specifies the contents to match.
substituteWhitespace
If set to true , then all whitespace characters match each other (including ‘\n’, ‘\r’, ‘\t’, and non-breaking spaces).
termIdFeature
Feature where to record the matched term’s identifier.
termLemmaFeature
Feature where to record the matched term’s lemma string.
termPosFeature
Feature where to record the matched term’s components POS tags.
wordStartCaseInsensitive
If set to true , then allow case folding on the first character of each word.
Deprecated parameters
head
Deprecated alias for headFeature .
modifier
Deprecated alias for modifierFeature .
monoHeadId
Deprecated alias for monoHeadIdFeature .
targetLayerName
Deprecated alias for targetLayer .
termId
Deprecated alias for termIdFeature .
termLemma
Deprecated alias for termLemmaFeature .
termPOS
Deprecated alias for termPosFeature .