AlvisNLP

corpus processing engine

How to write a plan

The basics

The plan file is an XML file that specifies the sequence of modules and their parameters.

The top-level tag of a plan file is alvisnlp-plan:

<alvisnlp-plan id="foo">
...
</alvisnlp>

The id attribute is mandatory, its value is the identifier of the plan. Make sure to fill a value that has some meaning for you.

Then, the alvisnlp-plan contains several module tags:

<alvisnlp-plan id="foo">
<read class="TextFileReader">
...
</read>

<words class="WoSMig">
...
</words>

...
</alvisnlp>

Each tag specifies an AlvisNLP module that will process the corpus. The class attribute is mandatory, it specifies the class of the module, that is what it does on the corpus. The value must be a supported module class.

Finally, each module tag contains tags that specify the parameter values for the module:

<alvisnlp-plan id="foo">
<read class="TextFileReader">
<sourcePath>...</sourcePath>
<sectionName>contents</sectionName>
</read>

<words class="WoSMig"/>

...
</alvisnlp>

In this example, we set two parameters in the module read: sourcePath and sectionName, and no parameter in words. The module class documentation should specify which parameters are supported, which are mandatory and what are the default values.

Parameter value conversion

The type of a parameter and conversion of the contents of the tag into the type are documented in the module class description. In this section we review the most used parameter types:

String

The contents of the tag must be a character string. Leading and trailing whitespaces are trimmed.

Integer

The contents of the tag must be a character string. The conversion is in base 10, with an optional a leading sign symbol (+ or -). Leading and trailing whitespaces are trimmed.

Boolean

The contents of the tag must be a character string.

true values false values
true false
on off
yes no

Leading and trailing whitespaces are trimmed.

SourceStream and TargetStream

A SourceStream and TargetStreamrepresents a file, a directory or an internet resource. SourceStream are input resources, while TargetStream represent outputs. The contents of the tag is either a path in the local filesystem or an URL to an remote resource. The conversion supports the main internet protocols (http, https, ftp) or the standard streams (stdin, stdout, stderr). It also supports the concatenation of several resources, the specification of all files in a directory, with filename filters. It also supports compression schemes.

File, InputFile, InputDirectory, OutputDirectory, OutputFile, ExecutableFile

These types represent resources in the local filesystem, their value cannot be remote URLs. AlvisNLP will check the values according to the type:

parameter type file exists file type permissions
InputDirectory yes directory rx
InputFile yes regular r
OutputDirectory yes 1 directory 1 rwx 1
OutputFile yes 1 directory or regular 2 rwx 2
ExecutableFile yes regular rx
File no regular 3  

1 some ancestor must exist and be a writable directory
2 if the file exists then it must be regular and writable, the innermost existing ancestor must be a writable directory
3 if the file exists then it must be regular

Expression

Expression parameters are evaluated only when the module is processing. Expressions allow to set values that depend on the state of the corpus. The value of the tag must follow the element expression syntax. See [[Element expression examples]] and [[Element expression reference]].

Arrays

Array parameter types can be recognized by a pair of brackets at the end of the type ([]). An array value is a sequence of values of the same type. You can either specify the elements separated with commas (,), or each element inside an enclosed tag (of arbitrary name).

The parameter tag may specify an alternate separator character with the separator attribute.

Mappings

Mapping parameter types map strings to values, where all values of the mapping are of the same type. The mapping can be specified by separating each entry with commas (,), and by separating the key from the value with an equal sign (=). Alternatively each entry can be specified with a tag, whose name is the entry key, end the contents is converted into a value.

The parameter tag may specify an alternate separator character with the separator attribute, and an alternate key-value separator with the qualifier attribute.

Complex types

Some modules accept parameters with more complex and composite types. Refer to the documentation in the converter reference.

Sequences

Sequences are sub-parts of the plan that contain modules (or other sequences):

<alvisnlp-plan id="foo">
<read class="TextFileReader">
<sourcePath>...</sourcePath>
<sectionName>contents</sectionName>
</read>

<segmentation>
<words class="WoSMig"/>
<sentences class="SeSMig"/>
</segmentation>

...
</alvisnlp>

Sequences do not alter the order of processing, their purpose is the organization of modules in logical bundles. Note that sequences affect the logging and may help you to read the AlvisNLP log.

Plan import

Plans can be reused inside other plans:

<alvisnlp-plan id="foo">
<read class="TextFileReader">
<sourcePath>...</sourcePath>
<sectionName>contents</sectionName>
</read>

<import file="/path/to/another/plan.xml"/>
...
</alvisnlp>

In this example, all the modules specified in /path/to/another/plan.xml will process the corpus as if the plan file had been included.

Plan-level parameters

You can define parameters for the whole plan, so you can set these parameters when the plan is imported.

A Plan-level parameter looks like this in project_species.xml:

<param name="speciesFile">
<alias module="project.species" param="dictFile" />
</param>

<project>
<species class="SimpleContentsProjector">
<!-- ... snip ... -->
</species>
</project>

An import of this file could look like:

<import file="project_species.xml">
<speciesFile value="/bibdev/resources/..."/>
</import>

This will import the plan specified in project_species.xml and set the parameter speciesFile. Since speciesFile is defined as an alias to the dictFile parameter in project.species, then it will be set.

A Plan-level parameter can be an alias for several parameters in different modules. When importing and setting these parameters, all aliases will have the same value. You have no excuses left: make modular plans!

More on parameters

The parameter tag may have attributes that change the conversion:

Option Effect
inhibitCheck="true" prevents AlvisNLP from checking this parameter value, for instance it will not check for the existence of InputFile parameters.
separator="C" sets the separator character between array elements or mapping entries (default: ,).
qualifier="C" sets the separator character between the key and the value of a mapping entry (default: =).
trim="false" prevents AlvisNLP from trimming leading and trailing whitespaces off the parameter value.
load="..." loads the specified file. This file must be an XML file, AlvisNLP sets the parameter value as if the parameter tag was the root element of this file. This attribute is useful for complex parameter values.

More attributes may be supported for the conversion to specific types.

Command-line control

The plan, especially parameter values, can be controlled from the command line.

-param

alvisnlp -param MODULE PARAM VALUE

The -param option sets the value of parameter PARAM in MODULE. MODULE is the identifier of a module specified in the plan. If the module is inside a sequence, then its identifier is in the form SEQUENCE.MODULE.
The VALUE is a string and it is converted as if it was the contents of the parameter tag.

-xparam

alvisnlp -param MODULE PARAM XVALUE

The -xparam option behaves the same way as -param but it expects an XML tag instead of a string value. This option is useful if you want to set parameters with conversion options.

-feat

alvisnlp -feat KEY VALUE

The -feat options adds a feature pair to the corpus before the processing starts. Expression parameters can get the value of this feature to alter the behavior of the modules.

-entity

alvisnlp -entity NAME REPLACEMENT

This option defines an XML entity that is used when the plan file is parsed.

-environmentEntities

alvisnlp -environmentEntities

This option defines an XML entity for each environment variable.