MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

•

1 gostou•383 visualizações

Francisco Couto

At BioCreative V.5 Workshop , April 26-27, 2017

Ciências

MER: a Minimal Named‐Entity
Recognition Tagger
and Annotation Server
Francisco M. Couto, Luis F. Campos, and Andre Lamurias
LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
BioCreative V.5 Workshop , April 26‐27, 2017

Why Minimal?
• TIPS (Technical interoperability and performance of annotation servers)
– it’s cool, we have to participate somehow 
• But we have limited computational resources
• Idea: Go Minimal
– Minimize the number of tools and steps to
perform Named‐Entity Recognition (NER)

What is Minimal?
• Flexibility
– Simple input
• Autonomy
– minimal set of components and software
dependencies
• Efficiency
– Low execution time

How Minimal?
• Only requires a lexicon as input
– a text file
• Only two components:
1. process the lexicon (offline)
2. produce the annotations (on‐the‐fly)
• GNU Bash shell script
– Using high performance grep and awk tools
– Portability: any Unix‐like operating system

Input
• lexicon text file
α‐maltose
nicotinic acid
nicotinic acid D‐ribonucleotide
nicotinic acid‐adenine dinucleotide phosphate

Pre‐Processing
== one‐word ( . . . word1 . txt )
α.maltose
== two‐word ( . . . word2 . txt )
nicotinic acid
== more‐words ( . . . words . txt )
nicotinic acid d.ribonucleotide
nicotinic acid.adenine dinucleotide phosphate
== first‐two‐words ( . . . words2 . txt )
nicotinic acid
nicotinic acid.adenine

Recognition
• Common Solution
– Apply grep directly to the input text
– execution time is proportional to the size of the
lexicon
• Inverted Solution
– input text as patterns matched against the lexicon
– more than 100 times faster
• TIPS chemical lexicon

Output
./get_entities.sh 'α‐maltose and nicotinic acid
D‐ribonucleotide was found, but not nicotinic
acid' lexicon
0       9       α‐maltose
14      28      nicotinic acid
65      79      nicotinic acid
14      45      nicotinic acid D‐ribonucleotide

Input: Lexicons
• Cell line and cell type
– Cellosaurus
• Chemical
– HMDB, ChEBI and ChEMBL
• Disease:
– Human Disease Ontology
• miRNA:
– miRBase
• Protein:
– Protein Ontology
• Subcellular structure:
– cellular component aspect of Gene Ontology
• Tissue and organ:
– tissue and organ subsets of UBERON
https://github.com/lasigeBioTM/MER/raw/biocreative2017/data/TIPS_MER_lexicons_Jan2017.zip

Lexicon Size
• more than 1M terms composed of more than
2M words and more than 25M characters

Input: text
• jq
– a command‐line JSON processor
– to parse the requests
• cURL
– to download each document
• Parsers
– PubMed, Patents, PMC
https://github.com/lasigeBioTM/MER/tree/biocreative2017/external_services
• NO CACHE

Output
• Added some more columns to MER output
– BeCalm TSV format
• The score
– 1‐1/ln(nc),
– nc = # characters of the recognized term

Infrastructure
• Three Virtual Machines (VM).
– Each ad 8GB of RAM and 4 CPUs @ 1.7 GHz
– CentOS Linux release 7.3.1611 (Core)
• VM (primary) to process the requests, distribute
the jobs, and execute MER.
• The other two VMs (secondary) just execute
MER.
• NGINX as HTTP server running CGI scripts
– high performance
• Task Spooler to manage and distribute jobs

Results
• April 21, 2017
• less than 3 seconds on average

Conclusions
• MER a minimal NER tagger
– Flexible: extensible to any lexicon
– Autonomous: only requires a GNU Bash shell
– Efficient: high‐performance capacity of grep
• Annotation Server
– developed in‐house
– minimal software dependencies
– and is open‐source
• Future: entity linking functionality in MER

Acknowledgments
• Portuguese National Distributed Computing
Infrastructure (http://www.incd.pt)
• Links
– https://github.com/lasigeBioTM/MER
– http://labs.fc.ul.pt/mer/

Mais conteúdo relacionado

Mais procurados

The Holy Grail of continuous delivery in distributed teams environmentSzymon Kurcab

Advance programming techniquesFaizan Haider

The Road to KubernetesDeniz Zoeteman

LCA14: LCA14-209: ODP Project UpdateLinaro

Fluentd Intro for OpenShift Commons BriefingEduardo Silva Pereira

ODP Presentation LinuxCon NA 2014Michael Christofferson

What we do with GoMarcelLanz

Tech Days 2015: Multi-language Programming with GPRbuildAdaCore

A sip of ElixirEmanuele DelBono

FluentD vs. LogstashAll Things Open

Full Stack Meat Project with Arduino Node AWS MobileKevin Kazmierczak

.Net IntroductionMuzzammil Wani

Running AWS LocallyChris Gillespie

Confgetti - Put A Leash On Your Configuration!Nikola Tuckovic

Securing Your Resources with Short-Lived Certificates!All Things Open

Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker

NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg

Javantura v4 - Support SpringBoot application development lifecycle using Ora...HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Microcontainers and Tools for Hardcore Container DebuggingOracle Developers

Go lambda-presentationSteven White

Mais procurados (20)

The Holy Grail of continuous delivery in distributed teams environment

Advance programming techniques

The Road to Kubernetes

LCA14: LCA14-209: ODP Project Update

Fluentd Intro for OpenShift Commons Briefing

ODP Presentation LinuxCon NA 2014

What we do with Go

Tech Days 2015: Multi-language Programming with GPRbuild

A sip of Elixir

FluentD vs. Logstash

Full Stack Meat Project with Arduino Node AWS Mobile

.Net Introduction

Running AWS Locally

Confgetti - Put A Leash On Your Configuration!

Securing Your Resources with Short-Lived Certificates!

Netflix Open Source: Building a Distributed and Automated Open Source Program

NetflixOSS Meetup season 3 episode 1

Javantura v4 - Support SpringBoot application development lifecycle using Ora...

Microcontainers and Tools for Hardcore Container Debugging

Go lambda-presentation

Semelhante a MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Careerdays dev opsNikos Hasiotis

Revamping Mailjet API documentation @ ParisAPI meetupMailjet

Performance optimisations PHP meetup RotterdamDimitri Vanoverbeke

Cv fayazFayaz Yusuf Khan

Resume_ApoorvaApoorva Pabbathi

Puppet latest and greatestATIX AG

report on internshala python training surabhimalviya1

OpenNTF EssentialsChristian Güdemann

Fusepool Machine Learning FrameworkFusepool SME project

Hot to build continuously processing for 24/7 real-time data streaming platform?GetInData

Linux Meetupr__2

#RADC4L16: An API-First Archives Approach at NPRCamille Salas

MODULE 1.pptxKPDDRAVIDIAN

Design thinking: Building a developer experience from scratchBecky Todd

Python for DevOps useRitesh Gupta

C#: Past, Present and FutureRodolfo Finochietti

Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking VN

APIs distribuidos con alta escalabilidadSoftware Guru

SGCE 2015 REST APIsDomingo Suarez Torres

Travis Oliphant "Python for Speed, Scale, and Science"Fwdays

Semelhante a MER: a Minimal Named-Entity Recognition Tagger and Annotation Server (20)

Careerdays dev ops

Revamping Mailjet API documentation @ ParisAPI meetup

Performance optimisations PHP meetup Rotterdam

Cv fayaz

Resume_Apoorva

Puppet latest and greatest

report on internshala python training

OpenNTF Essentials

Fusepool Machine Learning Framework

Hot to build continuously processing for 24/7 real-time data streaming platform?

Linux Meetup

#RADC4L16: An API-First Archives Approach at NPR

MODULE 1.pptx

Design thinking: Building a developer experience from scratch

Python for DevOps use

C#: Past, Present and Future

Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale

APIs distribuidos con alta escalabilidad

SGCE 2015 REST APIs

Travis Oliphant "Python for Speed, Scale, and Science"

Mais de Francisco Couto

Master's Theses in Bioinformatics and Computational BiologyFrancisco Couto

Linked Data – challenges for Imagiology and RadiologyFrancisco Couto

Metadata Analyser: measuring metadata qualityFrancisco Couto

Towards a privacy-preserving environment for genomic data analysisFrancisco Couto

A Large-Scale Characterization of User Behaviour in Cable TVFrancisco Couto

A Flexible Recommendation System for Cable TVFrancisco Couto

Master in Bioinformatics and Computational BiologyFrancisco Couto

KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...Francisco Couto

Bioinf2Bio OportunidadesFrancisco Couto

Stabvida oportunidades profissionaisFrancisco Couto

Mestrado em Bioinformática e Biologia Computacional da FCULFrancisco Couto

Mais de Francisco Couto (11)

Master's Theses in Bioinformatics and Computational Biology

Linked Data – challenges for Imagiology and Radiology

Metadata Analyser: measuring metadata quality

Towards a privacy-preserving environment for genomic data analysis

A Large-Scale Characterization of User Behaviour in Cable TV

A Flexible Recommendation System for Cable TV

Master in Bioinformatics and Computational Biology

KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...

Bioinf2Bio Oportunidades

Stabvida oportunidades profissionais

Mestrado em Bioinformática e Biologia Computacional da FCUL

Último

Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane

LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2AuEnriquezLontok

GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh

Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju

6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju

办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书zdzoqco

Introduction of Human Body & Structure of cell.pptxMedical College

PLASMODIUM. PPTXGovt. N.P.G College of Science Raipur (C.G)

Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas

DNA isolation molecular biology practical.pptxGiDMOh

whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university

Replisome-Cohesin Interfacing A Molecular Perspective.pdfAtiaGohar1

final waves properties grade 7 - third quarterHanHyoKim

well logging & petrophysical analysis.pptxzaydmeerab121

Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni

GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin

Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard

DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDivyaK787011

AZOTOBACTER AS BIOFERILIZER.PPTXGovt. N.P.G College of Science Raipur (C.G)

Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova

MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

1. MER: a Minimal Named‐Entity Recognition Tagger and Annotation Server Francisco M. Couto, Luis F. Campos, and Andre Lamurias LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal BioCreative V.5 Workshop , April 26‐27, 2017

2. Why Minimal? • TIPS (Technical interoperability and performance of annotation servers) – it’s cool, we have to participate somehow  • But we have limited computational resources • Idea: Go Minimal – Minimize the number of tools and steps to perform Named‐Entity Recognition (NER)

3. What is Minimal? • Flexibility – Simple input • Autonomy – minimal set of components and software dependencies • Efficiency – Low execution time

4. How Minimal? • Only requires a lexicon as input – a text file • Only two components: 1. process the lexicon (offline) 2. produce the annotations (on‐the‐fly) • GNU Bash shell script – Using high performance grep and awk tools – Portability: any Unix‐like operating system

5. Input • lexicon text file α‐maltose nicotinic acid nicotinic acid D‐ribonucleotide nicotinic acid‐adenine dinucleotide phosphate

6. Pre‐Processing == one‐word ( . . . word1 . txt ) α.maltose == two‐word ( . . . word2 . txt ) nicotinic acid == more‐words ( . . . words . txt ) nicotinic acid d.ribonucleotide nicotinic acid.adenine dinucleotide phosphate == first‐two‐words ( . . . words2 . txt ) nicotinic acid nicotinic acid.adenine

7. Recognition • Common Solution – Apply grep directly to the input text – execution time is proportional to the size of the lexicon • Inverted Solution – input text as patterns matched against the lexicon – more than 100 times faster • TIPS chemical lexicon

8. Input text as patterns

9. Output ./get_entities.sh 'α‐maltose and nicotinic acid D‐ribonucleotide was found, but not nicotinic acid' lexicon 0 9 α‐maltose 14 28 nicotinic acid 65 79 nicotinic acid 14 45 nicotinic acid D‐ribonucleotide

10. ANNOTATION SERVER

11. Input: Lexicons • Cell line and cell type – Cellosaurus • Chemical – HMDB, ChEBI and ChEMBL • Disease: – Human Disease Ontology • miRNA: – miRBase • Protein: – Protein Ontology • Subcellular structure: – cellular component aspect of Gene Ontology • Tissue and organ: – tissue and organ subsets of UBERON https://github.com/lasigeBioTM/MER/raw/biocreative2017/data/TIPS_MER_lexicons_Jan2017.zip

12. Lexicon Size • more than 1M terms composed of more than 2M words and more than 25M characters

13. Input: text • jq – a command‐line JSON processor – to parse the requests • cURL – to download each document • Parsers – PubMed, Patents, PMC https://github.com/lasigeBioTM/MER/tree/biocreative2017/external_services • NO CACHE

14. Output • Added some more columns to MER output – BeCalm TSV format • The score – 1‐1/ln(nc), – nc = # characters of the recognized term

15. Infrastructure • Three Virtual Machines (VM). – Each ad 8GB of RAM and 4 CPUs @ 1.7 GHz – CentOS Linux release 7.3.1611 (Core) • VM (primary) to process the requests, distribute the jobs, and execute MER. • The other two VMs (secondary) just execute MER. • NGINX as HTTP server running CGI scripts – high performance • Task Spooler to manage and distribute jobs

16. Results • April 21, 2017 • less than 3 seconds on average

17. Web Tool http://labs.fc.ul.pt/mer/

18. RESTful Web service

19. Conclusions • MER a minimal NER tagger – Flexible: extensible to any lexicon – Autonomous: only requires a GNU Bash shell – Efficient: high‐performance capacity of grep • Annotation Server – developed in‐house – minimal software dependencies – and is open‐source • Future: entity linking functionality in MER

20. Acknowledgments • Portuguese National Distributed Computing Infrastructure (http://www.incd.pt) • Links – https://github.com/lasigeBioTM/MER – http://labs.fc.ul.pt/mer/

MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Semelhante a MER: a Minimal Named-Entity Recognition Tagger and Annotation Server (20)

Mais de Francisco Couto

Mais de Francisco Couto (11)

Último

Último (20)

MER: a Minimal Named-Entity Recognition Tagger and Annotation Server