AlvisNLP

corpus processing engine

MultiRegExp

Synopsis

Search for several regular expressions in sections contents.

This module is experimental.

Description

MultiRegExp attempts to match regular expression patterns read from patternsFile on section contents. The patterns file is a CSV file where one column contains patterns. The patterns must follow the Java Pattern syntax .

MultiRegExp creates an annotation in targetLayer for each match. Additionally MultiRegExp adds to the annotation a feature for each column corresponding to the matched pattern.

The matches for each individual pattern will not overlap, however matches of different patterns may overlap.

Snippet

<multiregexp class="MultiRegExp">
    <patternsFile></patternsFile>
    <targetLayer></targetLayer>
    <valueFeatures></valueFeatures>
</multiregexp>

Mandatory parameters

patternsFile

Mandatory

Type: SourceStream

CSV file containing patterns.

targetLayer

Mandatory

Type: String

Layer where to place annotations.

valueFeatures

Mandatory

Name of the features created for each annotation, corresponding to the columns of patternsFile including the patterns column.

Optional parameters

constantAnnotationFeatures

Optional

Type: Mapping

Constant features to add to each annotation created by this module.

delimiter

Optional

Type: Character

Column delimiter of CSV file.

escape

Optional

Type: Character

Character used to escape characters in column values.

headerLine

Optional

Type: Boolean

Either to skip the first row.

quote

Optional

Type: Character

Character used to quote the column values.

trimValues

Optional

Type: Boolean

Either to trim leading and trailing whitespaces from column values.

baseFormat

Default value: `Delimiter=<,> QuoteChar=<"> RecordSeparator=< > EmptyLines:ignored SkipHeaderRecord:false`

Type: CSVFormat

Base format of CSV file. Must be either: deault, excel, mysql, rfc4180, oracle, postgresql_csv, postgresql_text, tdf, tab.

caseInsensitive

Default value: `false`

Type: Boolean

Either the match is insensitive to case.

documentFilter

Default value: `true`

Type: Expression

Only process document that satisfy this expression.

keyColumn

Default value: `0`

Type: Integer

Column index that contains patterns. First column is 0 .

matchWordBoundaries

Default value: `false`

Type: Boolean

Only create annotations for matches that fit exactly between word boundaries.

sectionFilter

Default value: `true`

Type: Expression

Process only sections that satisfy this expression.

Deprecated parameters