AlvisNLP

corpus processing engine

YateaTermsProjector

Synopsis

Search in the sections content for terms extracted by YaTeA (see YateaExtractor ).

Description

YateaTermsProjector reads terms in a YaTeA XML output file produced by YateaExtractor and searches for terms in section contents, or whatever specified by subject .

The parameters allowJoined , allUpperCaseInsensitive , caseInsensitive , ignoreDiacritics , joinDash , matchStartCaseInsensitive , skipConsecutiveWhitespaces , skipWhitespace and wordStartCaseInsensitive control how the keys can match the sections content.

The subject parameter specifies which text of the section should be matched. There are two alternatives:

YateaTermsProjector creates an annotation for each matched key and adds these annotations to the layer specified by targetLayer . Term structure information can be recorded in the features specified by termIdFeature , headFeature , monoHeadIdFeature , modifierFeature , and termPosFeature . In addition, the created annotations will have the constant features specified in constantAnnotationFeatures .

trieSource and trieSink are not supported by YateaTermsProjector .

Snippet

<yateatermsprojector class="YateaTermsProjector">
    <targetLayer></targetLayer>
    <yateaFile></yateaFile>
</yateatermsprojector>

Mandatory parameters

targetLayer

Mandatory
Type: String

Name of the layer that contains the match annotations.

yateaFile

Mandatory

YaTeA output XML file, as produced by YateaExtractor .

Optional parameters

constantAnnotationFeatures

Optional
Type: Mapping

Constant features to add to each annotation created by this module.

trieSink

Optional
Type: OutputFile

If set, then YateaTermsProjector writes the compiled dictionary to the specified file.

trieSource

Optional
Type: InputFile

If set, read the compiled dictionary from the specified file. Compiled dictionaries are usually faster for large dictionaries.

allUpperCaseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allow case folding on all characters in words that are all upper case.

allowJoined

Default value: `false`
Type: Boolean

If set to true , then allow arbitrary suppression of whitespace characters in the subject. For instance, the contents aminoacid matches the key amino acid .

caseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allows case folding on all characters.

documentFilter

Default value: `true`
Type: Expression

Only process document that satisfy this expression.

headFeature

Default value: `head`
Type: String

Feature where to record the matched term’s head identifier.

ignoreDiacritics

Default value: `false`
Type: Boolean

If set to true , then allow dicacritic removal on all characters. For instance the contents acide amine matches the key acide aminé .

joinDash

Default value: `false`
Type: Boolean

If set to true , then treat dash characters (-) as whitespace characters with regard to allowJoined . For instance, the contents aminoacid matches the entry amino-acid .

matchStartCaseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allow case folding on the first character of the entry key.

mnpOnly

Default value: `false`
Type: Boolean

If true , then YateaTermsProjector only searches for MNP terms.

modifierFeature

Default value: `modifier`
Type: String

Feature where to record the matched term’s modifier identifier.

monoHeadIdFeature

Default value: `mono-head`
Type: String

Feature where to record the matched term’s mono-head (or superhead, or single-token head) identifier.

multipleEntryBehaviour

Default value: `all`

Specifies the behavior if the lexicon contains several entries with the same key.

projectLemmas

Default value: `false`
Type: Boolean

If true , the this searches for term lemmas instead of surface forms.

sectionFilter

Default value: `true`
Type: Expression

Process only sections that satisfy this expression.

skipConsecutiveWhitespaces

Default value: `false`
Type: Boolean

If set to true , then allow the insertion of consecutive whitespace characters in the subject. For instance, the contents amino acid matches the entry amino acid .

skipWhitespace

Default value: `false`
Type: Boolean

If set to true , then allow arbitrary insertion of whitespace characters in the subject. For instance, the contents amino acid matches the key aminoacid .

subject

Default value: `WORD`
Type: Subject

Specifies the contents to match.

substituteWhitespace

Default value: `false`
Type: Boolean

If set to true , then all whitespace characters match each other (including ‘\n’, ‘\r’, ‘\t’, and non-breaking spaces).

termIdFeature

Default value: `term-id`
Type: String

Feature where to record the matched term’s identifier.

termLemmaFeature

Default value: `lemma`
Type: String

Feature where to record the matched term’s lemma string.

termPosFeature

Default value: `pos`
Type: String

Feature where to record the matched term’s components POS tags.

wordStartCaseInsensitive

Default value: `false`
Type: Boolean

If set to true , then allow case folding on the first character of each word.

Deprecated parameters

Deprecated
Type: String

Deprecated alias for headFeature .

modifier

Deprecated
Type: String

Deprecated alias for modifierFeature .

monoHeadId

Deprecated
Type: String

Deprecated alias for monoHeadIdFeature .

targetLayerName

Deprecated
Type: String

Deprecated alias for targetLayer .

termId

Deprecated
Type: String

Deprecated alias for termIdFeature .

termLemma

Deprecated
Type: String

Deprecated alias for termLemmaFeature .

termPOS

Deprecated
Type: String

Deprecated alias for termPosFeature .