SIA is implemented in Java and uses RabbitMQ [8] as its message bus implementation. In the following each individual component of SIA is described in detail.
Front end
The front end encapsulates the annotation processing for the clients and serves as the entry point to the system. Currently it provides a REST endpoint according to the Becalm-TIPS task specification. Other entry points, such as interactive parsing can easily be added. Incoming requests are translated into messages and forwarded to an input queue. This way, the overall processing in the front end is very lightweight and new requests can be handled irrespectively of any ongoing annotation processing. Furthermore, the back end does not need to be online at the time of a request, but instead could be started dynamically based on observed load.
To handle multiple concurrent requests with varying deadlines, we make use of the fact that the input queue is a priority queue, and prioritize messages with an earlier expiry date. Already running requests will not be canceled, the priority is just used as a fast path to the front of the queue. The message expiry date, as provided by the calling clients, is translated into a message priority using the currently processed messages and their deadlines as well as past elapsed processing time statistics to estimate the individual message urgency.
The front end also handles validation and authorization, which moves this logic into a central place. Furthermore, the front end provides a monitoring entry point into the system, reporting computation statistics, such as request rates, recent document types as well as back end processing counters, for display in dashboards and for observing the current health of the system.
Back end
The back end is concerned with fetching documents from the supported corpus providers, calling the requested annotators for each resulting text fragment, aggregating the results and feeding them to a result handler. It is modeled using a pipeline of message transformations, which subsequently read from message queues and post back to new ones. The message flow starts by reading new requests from the input queue, which is filled by the front end. The front end does not communicate directly with the back end, but instead the input queue is used as a hand over point. Since a single annotation request, in the case of the Becalm-TIPS task specification, may contain multiple document ids, incoming messages are first split into document-level messages. Splitting takes one message as input and generates as many individual messages as there are document ids specified. The raw text for each document is then retrieved by passing the messages through corpus adapters. The outcome is the retrieved text, separated into fields for abstract, title and potentially full text.
Raw texts messages are then delivered to registered annotators using a scatter-gather approach. Each message is duplicated (scattered) to the respective input queue of a qualified annotator. To find the annotator, the required annotator type per message is translated into a queue name, as each annotator has a dedicated input queue. Upon completion all resulting annotation messages are combined together (gathered) into a single message. This design allows to add new annotators by registering a new input queue and adding it to the annotation type mapping. This mapping is also exposed as a runtime configuration, which allows to dynamically (de-)activate annotators.
The next step in the message flow aggregates all annotation results across all documents that belong to the same request. It is the inverse of the initial split operation, and aggregates all messages sharing the same unique request id into a single one. Overlapping annotations (e.g., from different annotator components) are merged without any specific post processing. This strategy allows end users the highest flexibility as annotations are not silently modified. Finally, the aggregated message is forwarded to the output queue.
While the processing flow is specified in a sequential manner, this does not entail single threaded execution. Each individual transformer, such as a corpus adapter or an annotator, works independently and can be further scaled out, if they present a processing bottleneck. Furthermore, multiple requests can be handled in parallel at different stages of the pipeline. Transacting the message delivery to each transformer and retrying on failure, provides the fault tolerance of the system. Overall, the back end specifies a pipeline of an ordered execution flow and provides two injection points where users, through configuration, can add new functionality with additional corpus adapters or new annotation handlers.
To increase the throughput of the back end, multiple instances of SIA can be started on different machines, where each instance would process requests in a round robin fashion.
Supported annotators
To illustrate the extensibility of our approach, we integrated named entity recognition (NER) components for six different entity types into SIA: mutation names are extracted using SETH [6]. For micro-RNA mentions we implement a set of regular expressions [9], which follow the recommendations for micro-RNA nomenclature [10]. Disease names are recognized using a dictionary lookup [11], generated from UMLS disease terms [12], and by using the DNorm tagger [13]. Chemical name mentions are detected with ChemSpot [14], Organisms using Linnaues [15] and Gene mentions using Banner [16].
Listing 3 shows the general interface contract SIA is expecting for each annotator. Each annotator receives an input text and is simply expected to return a set of found annotations. Thus integrating any of the aforementioned annotators, as well as new ones, is as simple as implementing this interface and registering a new queue mapping.
Annotation handlers can be hosted inside of SIA, within the same process, or externally, in a separate process. External hosting allows to integrate annotation tools across programming languages, operating systems and servers. This is especially useful since most annotators have conflicting dependencies that are either very hard or impossible to resolve. For example, ChemSpot and DNorm use different versions of the Banner tagger which make them candidates for external hosting. Multiple servers can also be used to increase the available resources for SIA, e.g., when hosting all annotators on the same machine exceeds the amount of available memory.
Corpus adapters
SIA contains corpus adapters for PubMed, PMC, and the BeCalm patent- and abstract servers, which communicate to external network services. These components are represented as transformers, which process document ids and return retrieved source texts. They are implemented following the interface definition shown in Listing 4 . If an adapter supports bulk fetching of multiple documents, we feed a configurable number of ids in one invocation.
As retrieving the full text translates into calling a potentially unreliable remote service over the network, retry on failure is used in case of recoverable errors. This is backed up by the observation that the most commonly observed error was a temporarily unavailable service endpoint. To spread retries, we use exponential backoff on continuous failures with an exponentially increasing time interval, capped at a maximum (initial wait 1s, multiplier 2, max wait 60s). If a corpus adapter fails to produce a result after retries are exhausted, we mark that document as unavailable and treat it as one without any text. This allows a trade-off between never advancing the processing, as a document could be part of a set of documents to be annotated, and giving up too early in case of transient errors.
Result handler
The result handler processes the aggregated annotation results from the back end, by consuming from a dedicated output queue. We implemented a REST component according to the TIPS task specification, which posts these annotations back to a dedicated endpoint. Additional handlers, such as statistics gatherer or result archiver, can easily be added.