AlvisNLP

corpus processing engine

MultiRegExp

Synopsis

Search for several regular expressions in sections contents.

This module is experimental.

Description

MultiRegExp attempts to match regular expression patterns read from patternsFile on section contents. The patterns file is a CSV file where one column contains patterns. The patterns must follow the Java Pattern syntax .

MultiRegExp creates an annotation in targetLayer for each match. Additionally MultiRegExp adds to the annotation a feature for each column corresponding to the matched pattern.

The matches for each individual pattern will not overlap, however matches of different patterns may overlap.

Snippet

<multiregexp class="MultiRegExp">
    <patternsFile></patternsFile>
    <targetLayer></targetLayer>
    <valueFeatures></valueFeatures>
</multiregexp>

Mandatory parameters

patternsFile

Mandatory

CSV file containing patterns.

targetLayer

Mandatory
Type: String

Layer where to place annotations.

valueFeatures

Mandatory
Type: String[]

Name of the features created for each annotation, corresponding to the columns of patternsFile including the patterns column.

Optional parameters

constantAnnotationFeatures

Optional
Type: Mapping

Constant features to add to each annotation created by this module.

delimiter

Optional
Type: Character

Column delimiter of CSV file.

escape

Optional
Type: Character

Character used to escape characters in column values.

headerLine

Optional
Type: Boolean

Either to skip the first row.

quote

Optional
Type: Character

Character used to quote the column values.

trimValues

Optional
Type: Boolean

Either to trim leading and trailing whitespaces from column values.

baseFormat

Default value: `Delimiter=<,> QuoteChar=<"> RecordSeparator=< > EmptyLines:ignored SkipHeaderRecord:false`
Type: CSVFormat

Base format of CSV file. Must be either: deault, excel, mysql, rfc4180, oracle, postgresql_csv, postgresql_text, tdf, tab.

caseInsensitive

Default value: `false`
Type: Boolean

Either the match is insensitive to case.

documentFilter

Default value: `true`
Type: Expression

Only process document that satisfy this expression.

keyColumn

Default value: `0`
Type: Integer

Column index that contains patterns. First column is 0 .

matchWordBoundaries

Default value: `false`
Type: Boolean

Only create annotations for matches that fit exactly between word boundaries.

sectionFilter

Default value: `true`
Type: Expression

Process only sections that satisfy this expression.

Deprecated parameters