OpenNLPDocumentCategorizerTrain

Synopsis

Train a document categorizer using the OpenNLP library.

This module is experimental.

Description

OpenNLPDocumentCategorizerTrain trains a document categorizer using the OpenNLP library. The documents and their class are specified by documents and categoryFeature . The classifier algorithm uses the document content specified by tokens and form .

By default the features are BOW but can be deactivated with bagOfWords . Additionally nGrams can be set to add n-gram features.

The classifier is stored in model . This file can be used by OpenNLPDocumentCategorizer .

Snippet

<opennlpdocumentcategorizertrain class="OpenNLPDocumentCategorizerTrain">
    <categoryFeature></categoryFeature>
    <language></language>
    <model></model>
</opennlpdocumentcategorizertrain>

Mandatory parameters

categoryFeature

Mandatory

Type: String

Feature where the category is read.

language

Mandatory

Type: String

Language of the documents (ISO 639-1 two-letter code).

model

Mandatory

Type: TargetStream

File where to store the classifier.

Optional parameters

classWeights

Optional

Type: IntegerMapping

Weight of samples of each class. This parameter is useful to compensate unbalanced training sets. The default weight is 1.

nGrams

Optional

Type: Integer

Maximum size of n-gram features (minimum is 2). If not set, then do not use n-gram features.

algorithm

Default value: `PERCEPTRON`

Type: OpenNLPAlgorithm

Categorization algorithm. Must be one of:

naive-bayes , nb
generalized-iterative-scaling , gis
perceptron
quasi-newton , qn , l-bfgs , lbfgs , bfgs

bagOfWords

Default value: `true`

Type: Boolean

Either to generate single-word features.

documents

Default value: `documents`

Type: Expression

Elements to classify. This expression is evaluated from the corpus.

form

Default value: `@form`

Type: Expression

Form of the token. This expression is evaluated as a string from the token.

iterations

Default value: `100`

Type: Integer

Number of learning iterations.

tokens

Default value: `sections.layer:words`

Type: Expression

Tokens of the elements to classify. This expression is evaluated as a list of elements from the element to classify.

AlvisNLP

OpenNLPDocumentCategorizerTrain

Synopsis

Description

Snippet

Mandatory parameters

categoryFeature

language

model

Optional parameters

classWeights

nGrams

algorithm

bagOfWords

documents

form

iterations

tokens

Deprecated parameters