AlvisNLP

corpus processing engine

LinguaLID

Synopsis

Identifies the language of a content using Lingua .

This module is experimental.

Description

LinguaLID evaluates target as a list of elements, then evaluates form for each one as a string. The language of evaluated content is predicted using the Lingua library.

The predicted language is stored in the feature specified by languageFeature using ISO 639-1 two-letter code. Optionally the confidence score is stored in languageConfidenceFeature .

There may be more than one prediction if languageCandidates is set to a number above 1. The last language value has the highest confidence. Low-confidence predictions can be excluded by specifying a value to confidenceThreshold .

The set of predicted languages can be restricted with includeLanguages .

Snippet

<lingualid class="LinguaLID">
</lingualid>

Mandatory parameters

Optional parameters

includeLanguages

Optional
Type: Language[]

Languages to consider in the prediction. Languages can be specified using either ISO 639-1 two-letter codes, 639-3 three-letter codes, or full language name.

languageConfidenceFeature

Optional
Type: String

Feature where to keep the predicition confidence score.

confidenceThreshold

Default value: `0.0`
Type: Double

Minimum value of confidence.

form

Default value: `contents`
Type: Expression

String content of the target (section contents by default).

languageCandidates

Default value: `1`
Type: Integer

Number of languages to predict.

languageFeature

Default value: `language`
Type: String

Feature where to store the predicted language.

target

Default value: `documents.sections`
Type: Expression

Elements to predict the language, by default document.contents .

Deprecated parameters