FasttextClassifierTrain
Synopsis
FasttextClassifierTrain trains a document classifier using FastText .
This module is experimental.
Description
FasttextClassifierTrain evaluates documents as a list of elements and trains FastText to classify them. The category of each document is specified by classFeature . The attributes used to discriminate classes are specified by attributes .
modelFile specifies where to write the result: the classification model receives the .bin
extension, and the word vectors .vec
.
Snippet
<fasttextclassifiertrain class="FasttextClassifierTrain">
<attributes></attributes>
<classFeature></classFeature>
<documents></documents>
<fasttextExecutable></fasttextExecutable>
<modelFile></modelFile>
</fasttextclassifiertrain>
Mandatory parameters
attributes
Attributes of each document. The set of attributes must be identical in training with FasttextClassifierTrain and in labeling with FasttextClassifierLabel .
classFeature
Feature that contains the category of the document.
documents
Documents to classify. This expression is evaluated as a list of elements from the corpus.
fasttextExecutable
Path to the FastText executable (see the GitHub page for installation instructions).
modelFile
Prefix for the classifier model and the word vector files.
Optional parameters
autotuneMetric
UNDOCUMENTED
buckets
Number of buckets [2000000].
classWeights
Weight to apply to documents of each category. The mapping keys are the different categories, the values are weights. The default weight is 1.
commandlineOptions
Additional command lines options passed to FastText.
epochs
Number of epochs [5]
learningRate
Learning rate [0.1].
lossFunction
Loss function [softmax].
maxCharGrams
Max length of char ngram [0].
minCharGrams
Min length of char ngram [0].
pretrainedVectors
Pre-trained word vectors. Pre-trained vectors are publicly available on the FastText site .
validationAttributes
Attributes of validation documents. By default the same value as attributes .
validationDocuments
Validation documents used for autotuning.
wordGrams
Max length of word ngram [1].
wordVectorSize
Size of word vectors [100].
autotune
Either to autotune hyperparameters that are not set. If true , the validationDocuments must be set.
autotuneDuration
Duration of autotune in seconds.
learningRateUpdateRate
Change the rate of updates for the learning rate [100].
minCount
UNDOCUMENTED
minCountLabel
Minimal number of word occurrences [1].
negativeSampling
Number of negatives sampled [5].
samplingThreshold
Sampling threshold [0.0001].
threads
Number of threads.
windowSize
Size of the context window [5].