This document describes MOLIERE, an automatic biomedical hypothesis generation system. It constructs a large network from over 25 million biomedical documents and uses it to identify implicit connections between medical concepts. Given a query of two concepts, it finds the shortest path between them in the network and analyzes nearby nodes and documents to generate a hypothesis topic model. It has applications in drug repurposing and extending knowledge to new domains. Open research questions include result interpretation, system verification, and incorporating additional data sources.
MOLIERE: Automatic Biomedical Hypothesis Generation System
1. MOLIERE
Automatic Biomedical Hypothesis Generation System
Justin Sybrandt Michael Shtutman Ilya Safro
Clemson University
School of Computing
University of South Carolina
College of Pharmacy
Clemson University
School of Computing
2. UNDISCOVERED PUBLIC KNOWLEDGE
• Proposed by Swanson in 1986
• The set of public knowledge is too large
• Contains implicit connections
2
3. HUMAN LIMITS
• No person can read everything
• 2,000 – 4,000 new papers daily
• People Specialize
• Limits knowledge sharing across disciplines
• Bias and Inconsistent Assumptions
3
8. 8
Methodology Highlights
ARROWSMITH • One of the first hypothesis generation systems.
• Found link between Fish Oil and Raynaud’s Disease.
DiseaseConnect • Finds connections between genes and diseases.
• Displays information in an interactive prompt.
BioLDA • Constructs high quality topic models aided by domain-specific information.
• Identified link betweenVenlafaxine and HTR1A.
RELATED APPROACHES
9. 9
Methodology Limitation Reference
ARROWSMITH Limited Document Set Neil R Smalheiser and Don R Swanson. 1998.
DiseaseConnect Limited Document Set Chun-Chi Liu et al. 2014.
BioLDA LimitedVocabulary Huijun Wang et al. 2011.
RELATED APPROACHES
10. NEW APPROACH
10
• Construct a large network
• Identify meaningful paths
• Extend paths to neighboring nodes
• Mine neighborhoods for important topics
12. NETWORK CONSTRUCTION
• National Library of Medicine (NLM)
• National Center for Biotechnology Information (NCBI)
• MEDLINE
• 25 Million Documents
• Titles and Abstracts
12
13. • SPECALIST NLPTOOLSET
• Natural LanguageToolKit (NLTK)
13
some example text that
hopefully more understandable
than typical medical abstract
This is some example text that is
hopefully more understandable
than the typical medical abstract.
Raw Data CleanedText
NETWORK CONSTRUCTION
14. NETWORK CONSTRUCTION
• Topical Pattern Mining
• Groups together common phrases
• Creates 2,3,…,n-grams
14
example text hopefully
more understandable typical
medical abstract
some example text that
hopefully more understandable
than typical medical abstract
CleanedText Phrases
15. NETWORK CONSTRUCTION
• FastText: Projects phrases into real valued vector space
• Long word embedding composed from subwords
15
example text hopefully
more understandable typical
medical abstract
Phrases
Point Cloud
30. • Hypothesis represented as a topic model
• Shown: Venlaflaxine vs. HTR1A
30
TOPIC: 0
antidepressant_drugs
milnacipran
org
selected
ht
TOPIC: 1
increase
reduced
treatment
dorsal_raphe_nucleus
effect
TOPIC: 2
rats
sert
ht_receptor
escitalopram
potency
antidepressant
produces
serotonin
“Sertraline”
antidepressant controlled by
HTR1A
antidepressant
VENLAFAXINE RESULTS
31. DRUG REPURPOSING EXAMPLE
31
• Drugs can be modified to treat new diseases
• Decreases drug development time and costs
HIVDRUG
?
Cancer
DDX3
32. DRUG REPURPOSING EXPERIMENT
32
• Ran nearly 10,000 queries involving DDX3:
Signal
Transduction
Transcription
Adhesion
Cancer
Development
Translation
RNA
DDX3
36. OPEN RESEARCH QUESTIONS
• Result Interpretation
• SystemVerification
• Automatic Network Tuning
• Streaming Network Reconstruction
• Inclusion of Additional Data Sources
36
37. THANK YOU
J. Sybrandt, M. Shtutman, I. Safro “MOLIERE:Automatic Biomedical Hypothesis Generation System”
Code and Data: https://people.cs.clemson.edu/~isafro/software.html
Email: JSYBRAN@CLEMSON.EDU