Module reference

List of all modules

Alphabetical list of all available modules

Readers

Reader modules read files or streams as documents and sections.

Modue class	Formats	Source parameters	Comments
AlvisAEReader	AlvisAE Database	`url`, `schema`, `username`, `password`, `campaignId`…	Also creates annotations and tuples
BioNLPSTReader	BioNLP-ST challenge	`textDir`, `a1Dir`, `a2Dir`	Also creates annotations and tuples
GeniaJSONReader	GENIA JSON	`source`	Also creates annotations
I2B2Reader	I2B2 challenge	`textDir`, `conceptsDir`, `assertionsDir`, `relationsDir`	Also creates annotations
LLLReader	LLL challenge	`source`	Also creates annotations
OBOReader	OBO	`oboFiles`	Each term as a document, name and synonyms as sections
PESVReader	PESV export	`docStream`, `entitiesStream`	Also creates annotations
PubAnnotationReader	PubAnnotation JSON format	`source`	Also creates annotations and tuples
PubTatorReader	PubTator	`sourcePath`	Also creates annotations
SQLImport	SGBDR	`url`, `schema`, `username`, `password`, `query`
TabularReader	Tab-separated text	`source`
TagTogReader	TagTog anndoc	`zipFile`
TextFileReader	Text	`sourcePath`
TikaReader	DOC, DOCX, PDF	`source`	Uses Apache Tika
TokenizedReader	one line per token	`source`
TreeTaggerReader	tree-tagger	`sourcePath`	Also creates words, POS-tags and lemmas
WebOfKnowledgeReader	Web of Knowledge	`source`
XMLReader	XML, HTML	`sourcePath`	Requires an XSLT stylesheet

Stylesheets for XMLReader

The AlvisNLP distribution contains pre-defined stylesheets.

Stylesheet location	Schema
`res://XMLReader/endnote2alvisnlp.xslt`	EndNote
`res://XMLReader/gene-train2alvisnlp.xslt`	gene-train
`res://XMLReader/html2alvisnlp.xslt`	HTML
`res://XMLReader/pmc2alvisnlp.xslt`	PubMed Central OA
`res://XMLReader/prodINRA2alvisnlp.xslt`	ProdINRA
`res://XMLReader/pubmed2alvisnlp.xslt`	PubMed

Multi-purpose reader

The AlvisNLP distribution ships with a plan that can read documents in various formats:

<read href="res://reader.plan">
  <select>...</select>
  <source>...</source>
</read>

The source parameter is the location of the document(s), its type and conversion is like SourceStream.

The select parameter is the format of the documents. It may take one of the following values:

`select`	Format	Equivalent Module class
lll	LLL challenge	LLLReader
pubtator	PubTator	PubTatorReader
text	Text	TextFileReader
pdf	PDF	TikaReader
doc	DOC, DOCX	TikaReader
tree-tagger	tree-tagger	TreeTaggerReader
wok	Web of Knowledge	WebOfKnowledgeReader
endnote	EndNote	XMLReader
html	HTML	XMLReader
prod-inra	ProdINRA	XMLReader
pubmed	PubMed	XMLReader
pmc	PubMed Central OA	XMLReader

Export

Export modules translate the contents of the data structure and write it into file or a set of files.

Module class	Outut parameter	Format	Comments
AggregateValues	`outFile`	Tab-separated text
AlvisAEWriter	`outDir`	AlvisAE JSON	Uses the json-simple library
AlvisIRIndexer	`indexDir`	AlvisIR index	Uses the Lucene and alvisir-core libraries
CompareElements	`outFile`	Text
CompareFeatures	`outFile`	Text
JsonExport	`outDir`, `fileName`	JSON	Uses the json-simple library
LayerComparator	`outFile`	Text
PubAnnotationExport	`outFile`	PubAnnotation JSON	Uses the json-simple library
QuickHTML	`outDir`	HTML
RDFExport	`outDir`, `fileName`	RDF	Uses the Jena library
RelpWriter	`outFile`	Relp
TabularExport	`outDir`, `fileName`	tab-separated text
WhatsWrongExport	`outFile`	WhatsWrongWithMyNLP
XMLWriter	`outDir`, `fileName`	XML	Requires an XSLT stylesheet

Projectors

Projector modules match entries from a lexicon on the section contents or annotations of a layer. Each projector class accepts a different format for the lexicon.

Module class	Lexicon parameter	Lexicon format	Comments
ElementProjector	`entries`	AlvisNLP data structure
OBOProjector	`oboFiles`	OBO	Uses the OBO library
RDFProjector	`source`	RDF (OWL, SKOS)	Uses the Jena library
TabularProjector	`dictFile`	tab-separated text
TomapProjector	`yateaFile`, `tomapClassifier`	YaTeA and ToMap
TreeTaggerTermsProjector	`termsFile`	TreeTagger
TyDIExportProjector	`lemmaFile`, `synonymsFile`, `quasiSynonymsFile`, `acronymsFile`, `mergeFile`, `typographicVariationsFile`	TyDI
XLSProjector	`xlsFile`	Excel XLS or XLSX	Uses the POI library
YateaTermsProjector	`yateaFile`	YaTeA

Pattern-matching

Pattern-matching modules matches user-defined patterns on section contents of annotations from a layer.

Module class	Pattern type	Output
Action	Expressions	Action expressions
CartesianProductTuples	Expressions	Tuples
PatternMatcher	Hearst-like patterns	Annotations and tuples
RegExp	Regular expressions	Annotations
MultiRegExp	Regular expressions	Annotations

Mappers

Mapper modules associate data from a dictionary file. Each mapper class accepts a different format for the dictionary.

Module class	Dictionary parameter	Format
ElementMapper	`entries`	AlvisNLP data structure
FileMapper	`mappingFile`	tab-separated text
OBOMapper	`oboFiles`	OBO

Machine-learning

This section presents the module classes that can be used to train and classify elements.

Training class	Prediction class	Target	Algorithm	Comments
TEESTrain	TEESClassify	Tuples	SVM
TomapTrain	TomapProjector	Annotations	ToMap
WapitiTrain	WapitiLabel	Annotations	CRF
WekaTrain	WekaPredict	Any	Various	Uses the Weka library
NA	REBERTPredict	Tuples	BERT	Uses re-bert
FasttextClassifierTrain	FasttextClassifierLabel	Any	Word vectors	Uses Fasttext
OpenNLPDocumentCategorizerTrain	OpenNLPDocumentCategorizer	Any	ME	Uses the OpenNLP library
ContesTrain	ContesPredict	Annotations	Word Embedding, LR	Uses CONTES

Additionally, the module class WekaSelectAttributes uses the Weka library for attribute selection.

ContesTrain and ContesPredict require word embeddings that can be generated with Word2Vec.

NER

Named entity recognition modules.

Module class	NE types
Chemspot	Chemical
GeniaTagger	Gene, Protein
Species	Taxon
StanfordNER	Person, Location, Organization
Stanza	Person, Location, Organization, Number, Currency
StanfordCoreNLP	24 types

Segmentation

Module class	Segments
SeSMig	Sentences
WoSMig	Words
Stanza	Tokens, sentences
StanfordCoreNLP	Tokens, sentences

Word and sentence splitting

The AlvisNLP distribution ships with a ready-made complete word and sentence splitter plan that can be imported like this:

<seg href="res://segmentation.plan"/>

This plan combines several modules that nadles correctly latin abbreviations, cesure hyphens, numbers, and dates. If you want to force entities as tokens, this plan assumes they are annotations in the layer named rigid-entities.

Linguistic processing

Module class	Function
LinguaLID	Language identification
Ab3P	Abbreviation recognition
CCGParser	Dependency parsing
CCGPosTagger	POS-tagging
EnjuParser	Dependency parsing
StanfordParser	Dependency parsing
GeniaTagger	POS-tagging, lemmatiation
LinguaLID	Language identification
PorterStemmer	Stemming
Stanza	Tokenization, POS-tagging, lemmatization, dependency parsing
StanfordCoreNLP	Tokenization, POS-tagging, lemmatization, dependency parsing
TreeTagger	POS-tagging, lemmatiation
YateaExtractor	Term extraction

Miscellanous

Module class	Function
Assert	Check assertions on selected elements
ClearLayers	Empty layers of all annotations
HttpServer	Halts processing and allows to browse the data structure
InsertContents	Clone sections and insert contents
KeywordsSelector	Select keywords using the specified metric
MergeLayers	Copy annotations from several layers to one target layer
MergeSections	Merge several sections of each document into a single one
NGrams	Create n-grams of annotations
PythonScript	Runs a Python script
RemoveContents	Clone sections and crop contents
RemoveEquivalent	Deduplicate elements using custom equality
RemoveOverlaps	Remove overlapping annotations in a layer
SetFeature	Set a feature on selected elements
Shell	Enter interactive mode
SplitOverlaps	Split overlapping annotations
SplitSections	Split sections