WoSMig
Synopsis
Performs word segmentation on section contents.
Description
WoSMig searches for word boundaries in the section contents, creates an annotation for each word and adds it to the layer targetLayer . The following are considered as word boundaries:
- consecutive whitespace characters, including ‘ ‘, newline, carriage return and horizontal tabulation;
- the positions before and after each punctuation character defined in punctuations and balancedPunctuations , thus a punctuation character always form a single-character word, a balanced punctuation breaks a word iff the corresponding punctuation is found.
If fixedFormLayer is defined then non-overlapping annotations in this layer will be added as is in targetLayer , the start and end positions of these annotations are considered as word boundaries and no word boundary is searched inside.
The created annotations have the feature annotationTypeFeature with a value corresponding to the word type:
- punctuation : if the word is a single-character punctuation;
- word : if the word is a plain non-punctuation word.
Snippet
<wosmig class="WoSMig">
</wosmig>
Mandatory parameters
Optional parameters
constantAnnotationFeatures
Constant features to add to each annotation created by this module.
fixedFormLayer
Name of the layer containing annotations that should not be split into several words.
annotationComparator
Comparator to use when removing overlapping fixed form annotations.
annotationTypeFeature
Name of the feature where to put the word type (word, punctuation, etc).
balancedPunctuations
Balanced punctuation characters. The opening punctuation must be immediately followed by the corresponding closing punctuation. If this parameter value has an odd length, then a warning will be issued and the last character will be ignored.
documentFilter
Only process document that satisfy this expression.
fixedType
Value of the type feature for annotations copied from fixed forms.
punctuationType
Value of the type feature for punctuation annotations.
punctuations
List of punctuations, be them weak or strong.
sectionFilter
Process only sections that satisfy this expression.
targetLayer
Layer where to store word annotations.
wordType
Value of the type feature for regular word annotations.
Deprecated parameters
fixedFormLayerName
Deprecated alias for fixedFormLayer .
targetLayerName
Deprecated alias for targetLayer .