AlvisNLP

corpus processing engine

FasttextClassifierTrain

Synopsis

FasttextClassifierTrain trains a document classifier using FastText .

This module is experimental.

Description

FasttextClassifierTrain evaluates documents as a list of elements and trains FastText to classify them. The category of each document is specified by classFeature . The attributes used to discriminate classes are specified by attributes .

modelFile specifies where to write the result: the classification model receives the .bin extension, and the word vectors .vec .

Snippet

<fasttextclassifiertrain class="FasttextClassifierTrain">
    <attributes></attributes>
    <classFeature></classFeature>
    <documents></documents>
    <fasttextExecutable></fasttextExecutable>
    <modelFile></modelFile>
</fasttextclassifiertrain>

Mandatory parameters

attributes

Mandatory

Type: FasttextAttribute[]

Attributes of each document. The set of attributes must be identical in training with FasttextClassifierTrain and in labeling with FasttextClassifierLabel .

classFeature

Mandatory

Type: String

Feature that contains the category of the document.

documents

Mandatory

Type: Expression

Documents to classify. This expression is evaluated as a list of elements from the corpus.

fasttextExecutable

Mandatory

Type: ExecutableFile

Path to the FastText executable (see the GitHub page for installation instructions).

modelFile

Mandatory

Type: OutputFile

Prefix for the classifier model and the word vector files.

Optional parameters

autotuneMetric

Optional

Type: String

UNDOCUMENTED

buckets

Optional

Type: Integer

Number of buckets [2000000].

classWeights

Optional

Type: IntegerMapping

Weight to apply to documents of each category. The mapping keys are the different categories, the values are weights. The default weight is 1.

commandlineOptions

Optional

Additional command lines options passed to FastText.

epochs

Optional

Type: Integer

Number of epochs [5]

learningRate

Optional

Type: Double

Learning rate [0.1].

lossFunction

Optional

Type: FasttextLossFunction

Loss function [softmax].

maxCharGrams

Optional

Type: Integer

Max length of char ngram [0].

minCharGrams

Optional

Type: Integer

Min length of char ngram [0].

pretrainedVectors

Optional

Type: InputFile

Pre-trained word vectors. Pre-trained vectors are publicly available on the FastText site .

validationAttributes

Optional

Type: FasttextAttribute[]

Attributes of validation documents. By default the same value as attributes .

validationDocuments

Optional

Type: Expression

Validation documents used for autotuning.

wordGrams

Optional

Type: Integer

Max length of word ngram [1].

wordVectorSize

Optional

Type: Integer

Size of word vectors [100].

autotune

Default value: `false`

Type: Boolean

Either to autotune hyperparameters that are not set. If true , the validationDocuments must be set.

autotuneDuration

Default value: `300`

Type: Integer

Duration of autotune in seconds.

learningRateUpdateRate

Default value: `100`

Type: Integer

Change the rate of updates for the learning rate [100].

minCount

Default value: `1`

Type: Integer

UNDOCUMENTED

minCountLabel

Default value: `0`

Type: Integer

Minimal number of word occurrences [1].

negativeSampling

Default value: `5`

Type: Integer

Number of negatives sampled [5].

samplingThreshold

Default value: `1.0E-4`

Type: Double

Sampling threshold [0.0001].

threads

Default value: `12`

Type: Integer

Number of threads.

windowSize

Default value: `5`

Type: Integer

Size of the context window [5].

Deprecated parameters