AlvisNLP data structure
The AlvisNLP data structure is an object shared by all processing steps. It is passed from one processing step to the next. The data structure contains the document structure and content, as well as annotations produced by successive steps. The understanding of the data structure is crucial to to use AlvisNLP since this object allows the steps to communicate with each other.
The following figure presents an UML-like specification of the AlvisNLP data structure.
Corpusobject represents a collection of documents. In an AlvisNLP run, the corpus is a unique object passed from module to module. A
Corpusobject has features and documents.
Documentobject represents a single document. Each document has an identifier which is unique in the corpus. A
Documentobject has features and sections.
Sectionobject contains a piece of the document’s text contents. Each section has a name, a contents, features, layers, and relations.
Layerobject is an annotation container. A
Layerobject has a name unique in the section.
Annotationobject represents a span of text created by a module. Each annotation is included in at least one layer. An
Annotationobject has a start and end which represent the coordinates of the annotation in the section’s contents, and features.
Relationobject is a tuple container. A
Relationobject has a name unique in the section and features.
Tupleobject represents a relation between several elements in the data structure. A
Tupleobject has several arguments, each argument is an element (
Relation, but most often
Tuple) accessible through a role name. A
Tupleobject also has features.
Features are key-value pairs that contain information on an element type, tag or property. Feature keys are not unique in an element, though when accessing a feature key, the last value is returned.