AlvisNLP

corpus processing engine

OpenNLPDocumentCategorizerTrain

Synopsis

Train a document categorizer using the OpenNLP library.

This module is experimental.

Description

OpenNLPDocumentCategorizerTrain trains a document categorizer using the OpenNLP library. The documents and their class are specified by documents and categoryFeature . The classifier algorithm uses the document content specified by tokens and form .

By default the features are BOW but can be deactivated with bagOfWords . Additionally nGrams can be set to add n-gram features.

The classifier is stored in model . This file can be used by OpenNLPDocumentCategorizer .

Snippet

<opennlpdocumentcategorizertrain class="OpenNLPDocumentCategorizerTrain">
    <categoryFeature></categoryFeature>
    <language></language>
    <model></model>
</opennlpdocumentcategorizertrain>

Mandatory parameters

categoryFeature

Mandatory
Type: String

Feature where the category is read.

language

Mandatory
Type: String

Language of the documents (ISO 639-1 two-letter code).

model

Mandatory

File where to store the classifier.

Optional parameters

classWeights

Optional

Weight of samples of each class. This parameter is useful to compensate unbalanced training sets. The default weight is 1.

nGrams

Optional
Type: Integer

Maximum size of n-gram features (minimum is 2). If not set, then do not use n-gram features.

algorithm

Default value: `PERCEPTRON`

Categorization algorithm. Must be one of:

bagOfWords

Default value: `true`
Type: Boolean

Either to generate single-word features.

documents

Default value: `documents`
Type: Expression

Elements to classify. This expression is evaluated from the corpus.

form

Default value: `@form`
Type: Expression

Form of the token. This expression is evaluated as a string from the token.

iterations

Default value: `100`
Type: Integer

Number of learning iterations.

tokens

Default value: `sections.layer:words`
Type: Expression

Tokens of the elements to classify. This expression is evaluated as a list of elements from the element to classify.

Deprecated parameters