OpenNLPDocumentCategorizerTrain
Synopsis
Train a document categorizer using the OpenNLP library.
This module is experimental.
Description
OpenNLPDocumentCategorizerTrain trains a document categorizer using the OpenNLP library. The documents and their class are specified by documents and categoryFeature . The classifier algorithm uses the document content specified by tokens and form .
By default the features are BOW but can be deactivated with bagOfWords . Additionally nGrams can be set to add n-gram features.
The classifier is stored in model . This file can be used by OpenNLPDocumentCategorizer .
Snippet
<opennlpdocumentcategorizertrain class="OpenNLPDocumentCategorizerTrain">
<categoryFeature></categoryFeature>
<language></language>
<model></model>
</opennlpdocumentcategorizertrain>
Mandatory parameters
categoryFeature
Feature where the category is read.
language
Language of the documents (ISO 639-1 two-letter code).
model
File where to store the classifier.
Optional parameters
classWeights
Weight of samples of each class. This parameter is useful to compensate unbalanced training sets. The default weight is 1.
nGrams
Maximum size of n-gram features (minimum is 2). If not set, then do not use n-gram features.
algorithm
Categorization algorithm. Must be one of:
- naive-bayes , nb
- generalized-iterative-scaling , gis
- perceptron
- quasi-newton , qn , l-bfgs , lbfgs , bfgs
bagOfWords
Either to generate single-word features.
documents
Elements to classify. This expression is evaluated from the corpus.
form
Form of the token. This expression is evaluated as a string from the token.
iterations
Number of learning iterations.
tokens
Tokens of the elements to classify. This expression is evaluated as a list of elements from the element to classify.