AlvisNLP

corpus processing engine

Word2Vec

Synopsis

Computes word embeddings using the CONTES/Gensis implementation.

This module is experimental.

Description

Computes word embeddings using the CONTES/Gensis implementation.

Snippet

<word2vec class="Word2Vec">
    <contesDir></contesDir>
    <python3Executable></python3Executable>
    <workers></workers>
</word2vec>

Mandatory parameters

contesDir

Mandatory

Root directory of CONTES.

python3Executable

Mandatory

Path to the Python 3 executable.

workers

Mandatory
Type: Integer

Use this many worker threads to train the model (=faster training with multicore machines).

Optional parameters

additionalArguments

Optional
Type: String[]

UNDOCUMENTED

jsonFile

Optional
Type: OutputFile

File where to write embeddings as a JSON object.

modelFile

Optional
Type: OutputFile

UNDOCUMENTED

txtFile

Optional
Type: OutputFile

File where to write embeddings as a table.

vectorFeature

Optional
Type: String

Name of the feature where to store embeddings of each token. If this parameter is not set, then embeddings are not stored in any feature.

documentFilter

Default value: `true`
Type: Expression

Only process document that satisfy this expression.

formFeature

Default value: `form`
Type: String

Feature to use as word form.

minCount

Default value: `0`
Type: Integer

UNDOCUMENTED

sectionFilter

Default value: `true and layer:sentences and layer:words`
Type: Expression

Process only sections that satisfy this expression.

sentenceLayer

Default value: `sentences`
Type: String

Name of the layer containing sentence annotations.

tokenLayer

Default value: `words`
Type: String

Name of the layer containing token annotations.

vectorSize

Default value: `200`
Type: Integer

The dimensionality of the feature vectors. Often effective between 100 and 300.

windowSize

Default value: `2`
Type: Integer

The maximum distance between the current and predicted word within a sentence.

Deprecated parameters

formFeatureName

Deprecated
Type: String

Deprecated alias for formFeature .

sentenceLayerName

Deprecated
Type: String

Deprecated alias for sentenceLayer .

tokenLayerName

Deprecated
Type: String

Deprecated alias for tokenLayer .

vectorFeatureName

Deprecated
Type: String

Deprecated alias for vectorFeature .