4. How Minimal?
• Only requires a lexicon as input
– a text file
• Only two components:
1. process the lexicon (offline)
2. produce the annotations (on‐the‐fly)
• GNU Bash shell script
– Using high performance grep and awk tools
– Portability: any Unix‐like operating system
11. Input: Lexicons
• Cell line and cell type
– Cellosaurus
• Chemical
– HMDB, ChEBI and ChEMBL
• Disease:
– Human Disease Ontology
• miRNA:
– miRBase
• Protein:
– Protein Ontology
• Subcellular structure:
– cellular component aspect of Gene Ontology
• Tissue and organ:
– tissue and organ subsets of UBERON
https://github.com/lasigeBioTM/MER/raw/biocreative2017/data/TIPS_MER_lexicons_Jan2017.zip
13. Input: text
• jq
– a command‐line JSON processor
– to parse the requests
• cURL
– to download each document
• Parsers
– PubMed, Patents, PMC
https://github.com/lasigeBioTM/MER/tree/biocreative2017/external_services
• NO CACHE
15. Infrastructure
• Three Virtual Machines (VM).
– Each ad 8GB of RAM and 4 CPUs @ 1.7 GHz
– CentOS Linux release 7.3.1611 (Core)
• VM (primary) to process the requests, distribute
the jobs, and execute MER.
• The other two VMs (secondary) just execute
MER.
• NGINX as HTTP server running CGI scripts
– high performance
• Task Spooler to manage and distribute jobs