AlvisNLP

corpus processing engine

FasttextClassifierTrain

Synopsis

FasttextClassifierTrain trains a document classifier using FastText .

This module is experimental.

Description

FasttextClassifierTrain evaluates documents as a list of elements and trains FastText to classify them. The category of each document is specified by classFeature . The attributes used to discriminate classes are specified by attributes .

modelFile specifies where to write the result: the classification model receives the .bin extension, and the word vectors .vec .

Snippet

<fasttextclassifiertrain class="FasttextClassifierTrain">
    <attributes></attributes>
    <classFeature></classFeature>
    <documents></documents>
    <fasttextExecutable></fasttextExecutable>
    <modelFile></modelFile>
</fasttextclassifiertrain>

Mandatory parameters

attributes

Mandatory

Attributes of each document. The set of attributes must be identical in training with FasttextClassifierTrain and in labeling with FasttextClassifierLabel .

classFeature

Mandatory
Type: String

Feature that contains the category of the document.

documents

Mandatory
Type: Expression

Documents to classify. This expression is evaluated as a list of elements from the corpus.

fasttextExecutable

Mandatory

Path to the FastText executable (see the GitHub page for installation instructions).

modelFile

Mandatory
Type: OutputFile

Prefix for the classifier model and the word vector files.

Optional parameters

autotuneMetric

Optional
Type: String

UNDOCUMENTED

buckets

Optional
Type: Integer

Number of buckets [2000000].

classWeights

Optional

Weight to apply to documents of each category. The mapping keys are the different categories, the values are weights. The default weight is 1.

commandlineOptions

Optional
Type: String[]

Additional command lines options passed to FastText.

epochs

Optional
Type: Integer

Number of epochs [5]

learningRate

Optional
Type: Double

Learning rate [0.1].

lossFunction

Optional

Loss function [softmax].

maxCharGrams

Optional
Type: Integer

Max length of char ngram [0].

minCharGrams

Optional
Type: Integer

Min length of char ngram [0].

pretrainedVectors

Optional
Type: InputFile

Pre-trained word vectors. Pre-trained vectors are publicly available on the FastText site .

validationAttributes

Optional

Attributes of validation documents. By default the same value as attributes .

validationDocuments

Optional
Type: Expression

Validation documents used for autotuning.

wordGrams

Optional
Type: Integer

Max length of word ngram [1].

wordVectorSize

Optional
Type: Integer

Size of word vectors [100].

autotune

Default value: `false`
Type: Boolean

Either to autotune hyperparameters that are not set. If true , the validationDocuments must be set.

autotuneDuration

Default value: `300`
Type: Integer

Duration of autotune in seconds.

learningRateUpdateRate

Default value: `100`
Type: Integer

Change the rate of updates for the learning rate [100].

minCount

Default value: `1`
Type: Integer

UNDOCUMENTED

minCountLabel

Default value: `0`
Type: Integer

Minimal number of word occurrences [1].

negativeSampling

Default value: `5`
Type: Integer

Number of negatives sampled [5].

samplingThreshold

Default value: `1.0E-4`
Type: Double

Sampling threshold [0.0001].

threads

Default value: `12`
Type: Integer

Number of threads.

windowSize

Default value: `5`
Type: Integer

Size of the context window [5].

Deprecated parameters