AlvisNLP

corpus processing engine

Module reference

List of all modules

Alphabetical list of all available modules

Readers

Reader modules read files or streams as documents and sections.

Modue class Formats Source parameters Comments
AlvisAEReader AlvisAE Database url, schema, username, password, campaignId Also creates annotations and tuples
BioNLPSTReader BioNLP-ST challenge textDir, a1Dir, a2Dir Also creates annotations and tuples
GeniaJSONReader GENIA JSON source Also creates annotations
I2B2Reader I2B2 challenge textDir, conceptsDir, assertionsDir, relationsDir Also creates annotations
LLLReader LLL challenge source Also creates annotations
OBOReader OBO oboFiles Each term as a document, name and synonyms as sections
PESVReader PESV export docStream, entitiesStream Also creates annotations
PubAnnotationReader PubAnnotation JSON format source Also creates annotations and tuples
PubTatorReader PubTator sourcePath Also creates annotations
SQLImport SGBDR url, schema, username, password, query  
TabularReader Tab-separated text source  
TagTogReader TagTog anndoc zipFile  
TextFileReader Text sourcePath  
TikaReader DOC, DOCX, PDF source Uses Apache Tika
TokenizedReader one line per token source  
TreeTaggerReader tree-tagger sourcePath Also creates words, POS-tags and lemmas
WebOfKnowledgeReader Web of Knowledge source  
XMLReader XML, HTML sourcePath Requires an XSLT stylesheet

Stylesheets for XMLReader

The AlvisNLP distribution contains pre-defined stylesheets.

Stylesheet location Schema
res://XMLReader/endnote2alvisnlp.xslt EndNote
res://XMLReader/gene-train2alvisnlp.xslt gene-train
res://XMLReader/html2alvisnlp.xslt HTML
res://XMLReader/pmc2alvisnlp.xslt PubMed Central OA
res://XMLReader/prodINRA2alvisnlp.xslt ProdINRA
res://XMLReader/pubmed2alvisnlp.xslt PubMed

Multi-purpose reader

The AlvisNLP distribution ships with a plan that can read documents in various formats:

<read href="res://reader.plan">
  <select>...</select>
  <source>...</source>
</read>

The source parameter is the location of the document(s), its type and conversion is like SourceStream.

The select parameter is the format of the documents. It may take one of the following values:

select Format Equivalent Module class
lll LLL challenge LLLReader
pubtator PubTator PubTatorReader
text Text TextFileReader
pdf PDF TikaReader
doc DOC, DOCX TikaReader
tree-tagger tree-tagger TreeTaggerReader
wok Web of Knowledge WebOfKnowledgeReader
endnote EndNote XMLReader
html HTML XMLReader
prod-inra ProdINRA XMLReader
pubmed PubMed XMLReader
pmc PubMed Central OA XMLReader

Export

Export modules translate the contents of the data structure and write it into file or a set of files.

Module class Outut parameter Format Comments
AggregateValues outFile Tab-separated text  
AlvisAEWriter outDir AlvisAE JSON Uses the json-simple library
AlvisIRIndexer indexDir AlvisIR index Uses the Lucene and alvisir-core libraries
CompareElements outFile Text  
CompareFeatures outFile Text  
JsonExport outDir, fileName JSON Uses the json-simple library
LayerComparator outFile Text  
PubAnnotationExport outFile PubAnnotation JSON Uses the json-simple library
QuickHTML outDir HTML  
RDFExport outDir, fileName RDF Uses the Jena library
RelpWriter outFile Relp  
TabularExport outDir, fileName tab-separated text  
WhatsWrongExport outFile WhatsWrongWithMyNLP  
XMLWriter outDir, fileName XML Requires an XSLT stylesheet

Projectors

Projector modules match entries from a lexicon on the section contents or annotations of a layer. Each projector class accepts a different format for the lexicon.

Module class Lexicon parameter Lexicon format Comments
ElementProjector entries AlvisNLP data structure  
OBOProjector oboFiles OBO Uses the OBO library
RDFProjector source RDF (OWL, SKOS) Uses the Jena library
TabularProjector dictFile tab-separated text  
TomapProjector yateaFile, tomapClassifier YaTeA and ToMap  
TreeTaggerTermsProjector termsFile TreeTagger  
TyDIExportProjector lemmaFile, synonymsFile, quasiSynonymsFile, acronymsFile, mergeFile, typographicVariationsFile TyDI  
XLSProjector xlsFile Excel XLS or XLSX Uses the POI library
YateaTermsProjector yateaFile YaTeA  

Pattern-matching

Pattern-matching modules matches user-defined patterns on section contents of annotations from a layer.

Module class Pattern type Output
Action Expressions Action expressions
CartesianProductTuples Expressions Tuples
PatternMatcher Hearst-like patterns Annotations and tuples
RegExp Regular expressions Annotations
MultiRegExp Regular expressions Annotations

Mappers

Mapper modules associate data from a dictionary file. Each mapper class accepts a different format for the dictionary.

Module class Dictionary parameter Format
ElementMapper entries AlvisNLP data structure
FileMapper mappingFile tab-separated text
OBOMapper oboFiles OBO

Machine-learning

This section presents the module classes that can be used to train and classify elements.

Training class Prediction class Target Algorithm Comments
TEESTrain TEESClassify Tuples SVM  
TomapTrain TomapProjector Annotations ToMap  
WapitiTrain WapitiLabel Annotations CRF  
WekaTrain WekaPredict Any Various Uses the Weka library
NA REBERTPredict Tuples BERT Uses re-bert
FasttextClassifierTrain FasttextClassifierLabel Any Word vectors Uses Fasttext
OpenNLPDocumentCategorizerTrain OpenNLPDocumentCategorizer Any ME Uses the OpenNLP library
ContesTrain ContesPredict Annotations Word Embedding, LR Uses CONTES

Additionally, the module class WekaSelectAttributes uses the Weka library for attribute selection.

ContesTrain and ContesPredict require word embeddings that can be generated with Word2Vec.

NER

Named entity recognition modules.

Module class NE types
Chemspot Chemical
GeniaTagger Gene, Protein
Species Taxon
StanfordNER Person, Location, Organization
Stanza Person, Location, Organization, Number, Currency
StanfordCoreNLP 24 types

Segmentation

Module class Segments
SeSMig Sentences
WoSMig Words
Stanza Tokens, sentences
StanfordCoreNLP Tokens, sentences

Word and sentence splitting

The AlvisNLP distribution ships with a ready-made complete word and sentence splitter plan that can be imported like this:

<seg href="res://segmentation.plan"/>

This plan combines several modules that nadles correctly latin abbreviations, cesure hyphens, numbers, and dates. If you want to force entities as tokens, this plan assumes they are annotations in the layer named rigid-entities.

Linguistic processing

Module class Function
LinguaLID Language identification
Ab3P Abbreviation recognition
CCGParser Dependency parsing
CCGPosTagger POS-tagging
EnjuParser Dependency parsing
StanfordParser Dependency parsing
GeniaTagger POS-tagging, lemmatiation
LinguaLID Language identification
PorterStemmer Stemming
Stanza Tokenization, POS-tagging, lemmatization, dependency parsing
StanfordCoreNLP Tokenization, POS-tagging, lemmatization, dependency parsing
TreeTagger POS-tagging, lemmatiation
YateaExtractor Term extraction

Miscellanous

Module class Function
Assert Check assertions on selected elements
ClearLayers Empty layers of all annotations
HttpServer Halts processing and allows to browse the data structure
InsertContents Clone sections and insert contents
KeywordsSelector Select keywords using the specified metric
MergeLayers Copy annotations from several layers to one target layer
MergeSections Merge several sections of each document into a single one
NGrams Create n-grams of annotations
PythonScript Runs a Python script
RemoveContents Clone sections and crop contents
RemoveEquivalent Deduplicate elements using custom equality
RemoveOverlaps Remove overlapping annotations in a layer
SetFeature Set a feature on selected elements
Shell Enter interactive mode
SplitOverlaps Split overlapping annotations
SplitSections Split sections