Module reference
List of all modules
Alphabetical list of all available modules
Readers
Reader modules read files or streams as documents and sections.
Modue class | Formats | Source parameters | Comments |
---|---|---|---|
AlvisAEReader | AlvisAE Database | url , schema , username , password , campaignId … |
Also creates annotations and tuples |
BioNLPSTReader | BioNLP-ST challenge | textDir , a1Dir , a2Dir |
Also creates annotations and tuples |
GeniaJSONReader | GENIA JSON | source |
Also creates annotations |
I2B2Reader | I2B2 challenge | textDir , conceptsDir , assertionsDir , relationsDir |
Also creates annotations |
LLLReader | LLL challenge | source |
Also creates annotations |
OBOReader | OBO | oboFiles |
Each term as a document, name and synonyms as sections |
PESVReader | PESV export | docStream , entitiesStream |
Also creates annotations |
PubAnnotationReader | PubAnnotation JSON format | source |
Also creates annotations and tuples |
PubTatorReader | PubTator | sourcePath |
Also creates annotations |
SQLImport | SGBDR | url , schema , username , password , query |
|
TabularReader | Tab-separated text | source |
|
TagTogReader | TagTog anndoc | zipFile |
|
TextFileReader | Text | sourcePath |
|
TikaReader | DOC, DOCX, PDF | source |
Uses Apache Tika |
TokenizedReader | one line per token | source |
|
TreeTaggerReader | tree-tagger | sourcePath |
Also creates words, POS-tags and lemmas |
WebOfKnowledgeReader | Web of Knowledge | source |
|
XMLReader | XML, HTML | sourcePath |
Requires an XSLT stylesheet |
Stylesheets for XMLReader
The AlvisNLP distribution contains pre-defined stylesheets.
Stylesheet location | Schema |
---|---|
res://XMLReader/endnote2alvisnlp.xslt |
EndNote |
res://XMLReader/gene-train2alvisnlp.xslt |
gene-train |
res://XMLReader/html2alvisnlp.xslt |
HTML |
res://XMLReader/pmc2alvisnlp.xslt |
PubMed Central OA |
res://XMLReader/prodINRA2alvisnlp.xslt |
ProdINRA |
res://XMLReader/pubmed2alvisnlp.xslt |
PubMed |
Multi-purpose reader
The AlvisNLP distribution ships with a plan that can read documents in various formats:
<read href="res://reader.plan">
<select>...</select>
<source>...</source>
</read>
The source
parameter is the location of the document(s), its type and conversion is like SourceStream.
The select
parameter is the format of the documents.
It may take one of the following values:
select |
Format | Equivalent Module class |
---|---|---|
lll | LLL challenge | LLLReader |
pubtator | PubTator | PubTatorReader |
text | Text | TextFileReader |
TikaReader | ||
doc | DOC, DOCX | TikaReader |
tree-tagger | tree-tagger | TreeTaggerReader |
wok | Web of Knowledge | WebOfKnowledgeReader |
endnote | EndNote | XMLReader |
html | HTML | XMLReader |
prod-inra | ProdINRA | XMLReader |
pubmed | PubMed | XMLReader |
pmc | PubMed Central OA | XMLReader |
Export
Export modules translate the contents of the data structure and write it into file or a set of files.
Module class | Outut parameter | Format | Comments |
---|---|---|---|
AggregateValues | outFile |
Tab-separated text | |
AlvisAEWriter | outDir |
AlvisAE JSON | Uses the json-simple library |
AlvisIRIndexer | indexDir |
AlvisIR index | Uses the Lucene and alvisir-core libraries |
CompareElements | outFile |
Text | |
CompareFeatures | outFile |
Text | |
JsonExport | outDir , fileName |
JSON | Uses the json-simple library |
LayerComparator | outFile |
Text | |
PubAnnotationExport | outFile |
PubAnnotation JSON | Uses the json-simple library |
QuickHTML | outDir |
HTML | |
RDFExport | outDir , fileName |
RDF | Uses the Jena library |
RelpWriter | outFile |
Relp | |
TabularExport | outDir , fileName |
tab-separated text | |
WhatsWrongExport | outFile |
WhatsWrongWithMyNLP | |
XMLWriter | outDir , fileName |
XML | Requires an XSLT stylesheet |
Projectors
Projector modules match entries from a lexicon on the section contents or annotations of a layer. Each projector class accepts a different format for the lexicon.
Module class | Lexicon parameter | Lexicon format | Comments |
---|---|---|---|
ElementProjector | entries |
AlvisNLP data structure | |
OBOProjector | oboFiles |
OBO | Uses the OBO library |
RDFProjector | source |
RDF (OWL, SKOS) | Uses the Jena library |
TabularProjector | dictFile |
tab-separated text | |
TomapProjector | yateaFile , tomapClassifier |
YaTeA and ToMap | |
TreeTaggerTermsProjector | termsFile |
TreeTagger | |
TyDIExportProjector | lemmaFile , synonymsFile , quasiSynonymsFile , acronymsFile , mergeFile , typographicVariationsFile |
TyDI | |
XLSProjector | xlsFile |
Excel XLS or XLSX | Uses the POI library |
YateaTermsProjector | yateaFile |
YaTeA |
Pattern-matching
Pattern-matching modules matches user-defined patterns on section contents of annotations from a layer.
Module class | Pattern type | Output |
---|---|---|
Action | Expressions | Action expressions |
CartesianProductTuples | Expressions | Tuples |
PatternMatcher | Hearst-like patterns | Annotations and tuples |
RegExp | Regular expressions | Annotations |
MultiRegExp | Regular expressions | Annotations |
Mappers
Mapper modules associate data from a dictionary file. Each mapper class accepts a different format for the dictionary.
Module class | Dictionary parameter | Format |
---|---|---|
ElementMapper | entries |
AlvisNLP data structure |
FileMapper | mappingFile |
tab-separated text |
OBOMapper | oboFiles |
OBO |
Machine-learning
This section presents the module classes that can be used to train and classify elements.
Training class | Prediction class | Target | Algorithm | Comments |
---|---|---|---|---|
TEESTrain | TEESClassify | Tuples | SVM | |
TomapTrain | TomapProjector | Annotations | ToMap | |
WapitiTrain | WapitiLabel | Annotations | CRF | |
WekaTrain | WekaPredict | Any | Various | Uses the Weka library |
NA | REBERTPredict | Tuples | BERT | Uses re-bert |
FasttextClassifierTrain | FasttextClassifierLabel | Any | Word vectors | Uses Fasttext |
OpenNLPDocumentCategorizerTrain | OpenNLPDocumentCategorizer | Any | ME | Uses the OpenNLP library |
ContesTrain | ContesPredict | Annotations | Word Embedding, LR | Uses CONTES |
Additionally, the module class WekaSelectAttributes uses the Weka library for attribute selection.
ContesTrain and ContesPredict require word embeddings that can be generated with Word2Vec.
NER
Named entity recognition modules.
Module class | NE types |
---|---|
Chemspot | Chemical |
GeniaTagger | Gene, Protein |
Species | Taxon |
StanfordNER | Person, Location, Organization |
Stanza | Person, Location, Organization, Number, Currency |
StanfordCoreNLP | 24 types |
Segmentation
Module class | Segments |
---|---|
SeSMig | Sentences |
WoSMig | Words |
Stanza | Tokens, sentences |
StanfordCoreNLP | Tokens, sentences |
Word and sentence splitting
The AlvisNLP distribution ships with a ready-made complete word and sentence splitter plan that can be imported like this:
<seg href="res://segmentation.plan"/>
This plan combines several modules that nadles correctly latin abbreviations, cesure hyphens, numbers, and dates.
If you want to force entities as tokens, this plan assumes they are annotations in the layer named rigid-entities
.
Linguistic processing
Module class | Function |
---|---|
LinguaLID | Language identification |
Ab3P | Abbreviation recognition |
CCGParser | Dependency parsing |
CCGPosTagger | POS-tagging |
EnjuParser | Dependency parsing |
StanfordParser | Dependency parsing |
GeniaTagger | POS-tagging, lemmatiation |
LinguaLID | Language identification |
PorterStemmer | Stemming |
Stanza | Tokenization, POS-tagging, lemmatization, dependency parsing |
StanfordCoreNLP | Tokenization, POS-tagging, lemmatization, dependency parsing |
TreeTagger | POS-tagging, lemmatiation |
YateaExtractor | Term extraction |
Miscellanous
Module class | Function |
---|---|
Assert | Check assertions on selected elements |
ClearLayers | Empty layers of all annotations |
HttpServer | Halts processing and allows to browse the data structure |
InsertContents | Clone sections and insert contents |
KeywordsSelector | Select keywords using the specified metric |
MergeLayers | Copy annotations from several layers to one target layer |
MergeSections | Merge several sections of each document into a single one |
NGrams | Create n-grams of annotations |
PythonScript | Runs a Python script |
RemoveContents | Clone sections and crop contents |
RemoveEquivalent | Deduplicate elements using custom equality |
RemoveOverlaps | Remove overlapping annotations in a layer |
SetFeature | Set a feature on selected elements |
Shell | Enter interactive mode |
SplitOverlaps | Split overlapping annotations |
SplitSections | Split sections |