AlvisNLP

corpus processing engine

WoSMig

Synopsis

Performs word segmentation on section contents.

Description

WoSMigsearches for word boundaries in the section contents, creates an annotation for each word and adds it to the layer targetLayerName . The following are considered as word boundaries:

If fixedFormLayerName is defined then non-overlapping annotations in this layer will be added as is in targetLayerName , the start and end positions of these annotations are considered as word boundaries and no word boundary is searched inside.

The created annotations have the feature annotationTypeFeature with a value corresponding to the word type:

The eosStatusFeature feature contains the end-of-sentence status of the word:

Snippet

<wosmig class="WoSMig>
</wosmig>

Mandatory parameters

Optional parameters

constantAnnotationFeatures

Optional
Type: Mapping

Constant features to add to each annotation created by this module.

fixedFormLayerName

Optional
Type: String

Name of the layer containing annotations that should not be split into several words.

annotationComparator

Default value: `length`

Comparator to use when removing overlapping fixed form annotations.

annotationTypeFeature

Default value: `wordType`
Type: String

Name of the feature where to put the word type (word, punctuation, etc).

balancedPunctuations

Default value: `()[]{}""`
Type: String

Balanced punctuation characters. The opening punctuation must be immediately followed by the corresponding closing punctuation. If this parameter value has an odd length, then a warning will be issued and the last character will be ignored.

documentFilter

Default value: `true`
Type: Expression

Only process document that satisfy this expression.

fixedType

Default value: `fixed`
Type: String

Value of the type feature for annotations copied from fixed forms.

punctuationType

Default value: `punctuation`
Type: String

Value of the type feature for punctuation annotations.

punctuations

Default value: `?.!;,:-`
Type: String

List of punctuations, be them weak or strong.

sectionFilter

Default value: `true`
Type: Expression

Process only sections that satisfy this expression.

targetLayerName

Default value: `words`
Type: String

Layer where to store word annotations.

wordType

Default value: `word`
Type: String

Value of the type feature for regular word annotations.