Keynote Integrative Bioinformatics 2018
https://docs.google.com/document/d/1E7D4_CS0vlldEcEuknXjEnSBZSZCJvbI5w1FdFh-gG4/edit
Can we improve research productivity through providing answers stemming from knowledge graphs? In this presentation, I discuss different ways of building and combining knowledge graphs.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
From Text to Data to the World: The Future of Knowledge Graphs
1. From Text to Data to
the World: The Future
of Knowledge Graphs
Paul Groth | pgroth.com | @pgroth
Thanks to: Matthew Clark, Frederik van den Broek, Anton Yuryev, Maria Shkrob, Sherri Matis-Mitchell,
Timothy Hoctor, Brad Allen, Corey Harper, Ron Daniel, Helena Deus, Olaf Lodbrok
2. June 15, 2018
• Research productivity
• Moving to answers – knowledge graphs
• Building knowledge graphs – from text
• Building knowledge graphs – from data
• Combining knowledge graphs
2
3. June 15, 2018
3
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
4. June 15, 2018
4
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
5. June 15, 2018
5
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
6. June 15, 2018
6
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
7. June 15, 2018
7
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
8. June 15, 2018
8
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
9. IN PRACTICE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2017).
Searching Data: A Review of Observational Data Retrieval Practices.
arXiv preprint arXiv:1707.06937.
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g.
early career researchers, policy makers, students) are not
well documented.
• Participants require details about data collection and
handling
• Reconstructing data tables from journal articles,
using general search engines, and making direct data
requests are common.
K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018).
Understanding Data Retrieval Practices: A Social Informatics Perspective.
arXiv preprint arXiv:1801.04971
10. ELSEVIER’S BUSINESS: PROVIDING ANSWERS FOR
RESEARCHERS, DOCTORS AND NURSES
My work is moving towards a new field; what should I know?
• Journal articles, reference works, profiles of researchers, funders &
institutions
• Recommendations of people to connect with, reading lists, topic pages
How should I treat my patient given her condition & history?
• Journal articles, reference works, medical guidelines, electronic health
records
• Treatment plan with alternatives personalized for the patient
How can I master the subject matter of the course I am taking?
• Course syllabus, reference works, course objectives, student history
• Quiz plan based on the student’s history and course objectives
11. THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
ANSWERS ARE ABOUT THINGS, NOT JUST WORKS
Why shouldn’t a search on an author return
information about the author, including the author’s
works? Where was the author born, when did she live,
what is she known for? … All of this is possible, but
only if we can make some fundamental changes in our
approach to bibliographic description. ... The challenge
for us lies in transforming what we can of our data into
interrelated “things” without overindulging that
metaphor.
Coyle, K. (2016). FRBR, before and after: a look at our bibliographical
models. Chicago: ALA Editions.
12. THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
KNOWLEDGE GRAPHS DEFINED
• Knowledge graphs are "graph structured knowledge bases (KBs) which store factual
information in form of relationships between entities”
• (Nickel, M., Murphy, K., Tresp, V. and Gabrilovich, E. (2015). A review of relational machine learning for knowledge graphs. arXiv:1503.00759v3)
• Knowledge graphs are metadata evolved beyond the focus on the work, linking people, concepts,
things and events
• Knowledge Graphs are focused on things to provide answers
16. • Total concepts = 540,632
• 100+ person years of clinical expert
knowledge
ONTOLOGY MAINTENANCE
17. 17
One Weird Trick from Natural Language Processing (NLP)
• Knowledge bases are populated by scanning text and doing Information Extraction
• Most information extraction systems are looking for very specific things, like drug-drug interactions
• Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text
• For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar
• The weird trick for open information extraction … a simple algorithm, known as ReVerb*:
1. Find “relation phrases” starting with a verb and ending with a verb or preposition
2. Find noun phrases before and after the relation phrase
3. Discard relation phrases not used with multiple combinations of arguments.
In addition, brain scans were performed to exclude
other causes of dementia.
* Fader et al. Identifying Relations for Open Information Extraction
20. CHALLENGES
Paul Groth, Michael Lauruhn, Antony Scerri. Ron Daniel: “Open Information
Extraction on Scientific Text: An Evaluation”, 2018; [http://arxiv.org/abs/1802.05574
arXiv:1802.05574] – To appear in COLING 2018
698 unique relations
400 sentences
22. 22
Medical Graph – Statistical correlations at scale
I65
Occlusion and stenosis
of precerebral arteries
G40
Epilepsy
has_successor
I61
C71
Malignant neoplasm
of brain
odds ratio: 1.12
intracerebral
hemorrhage has_successor criteria1:
• Correlation selected by
preditive modeling
algorithmus
• No. of relations is higher
than in mirrored relation
• p-value < 0,05
• Odds ratios balanced over
all covariates.
1 Criteria based on: Jensen et.al.: Temporal disease trajectories condensed from population-wide registry data covering
6.2 million patients. Nature Communications, 2014 Jun 24 ;5:4022. doi: 10.1038/ncomms5022.
Other
covariates
Primary care
Secondary care
Drug prescriptions
5m patients
each 6 years longitudinality
23. 23
Medical Graph in practice, patient 35: risk of depression
• 49 year old man
• Dx: overweight,
diabetes,
hypertension,
anxiety disorder
has an absolute
risk of 36% to
develop a
depression within
the next 4 years
25. 25
• Targets for prediction: ICD-coded diagnoses
• Only incident patients per diagnose considered, i.e. diagnosis-free 2009 – 2010
• if these patients remain diagnosis-free 2011 - 2014 (observation period), then 0 else 1
• Covariates: all ICD-/ATC-codes, age and sex measured in 2010
Example: Model to predict „I50 – Heart Failure“
25
Predict 4 year long-term effects, balanced for all co-variables
I50 -
I50 free patients
2009 2010
time
I50 -
(coded
as 0)
I50 +
(coded
as1)
2011 2014
Covariates
Remaining I50 free patients/ newly I50 diagnosed patients
26. 26
Technology
stack
feature
extraction
For 3.8m patients:
• age, gender
• all diagnoses: ICD10-coded, 3 digits, i.e. 2054 codes
• all medications: ATC-coded, 5 digits, i.e. 906 codes
• death, hospitalization
Results in: 6277 features
• 1623 targets, 2011-2014
• 2320 covariates, 2010
• 2334 filter-columns, 2009-2010
data mining Calculate prevalence, incidence, mean age for all covariates (i.e.
diseases and medications)
machine
learning
Predictive modelling for ~1600 targets
• Linear classification model, resulting in odds ratios
• Calculation of p-values
Calculate statistics & build prediction models for ~1600 targets
28. 28
| 28
• A rare genetic disease
• Permanently excessive level of
insulin in the blood
• Develops within the first few days of life
Symptoms include floppiness, shakiness, poor
feedings, seizures, fits and convulsions.
• If not caught quickly can lead to brain
injury or even death.
• In the most severe cases the only viable
treatment is the removal of the
pancreas, consigning the patient to a
lifetime of diabetes.
Example: Treatments for Congenital Hyperinsulinism
is a UK
charity that is building the rare
disease community to raise
awareness, drive research and
develop treatments.
is partnering
with Findacure scientists to help
identify and evaluate treatments
for this devastating disease.
29. 29
29
Biological Pathways
extracted via semantic
text mining
A upregulates B
B upregulates C
C increases
Disease
Normalizing vocabularies required: proteins, diseases, drugs, chemicals
A B C
disease
Bioactivities
through text analysis
IC50 6.3nM, kinase
binding assay 10mM
concentration
Chemical Structures
And Properties
InChi,
Name
NCBI,
Uniprot
EMTREE
ReaxysTree,
Structures
30. 30
| 30
From pathways to treatments:
Biovia PipelinePilot implementation combines data sources
Automated analysis combines bioassay data with pathway data
Find all targets that could
be used to affect the
disease state
Query for each target to find
the activities for each
compound that are >6 log units
Collate data by compound to summarize the
targets/activities related to disease that the
compound hits
• Compute geometric mean of activities for ranking
• Rank by number of targets and geometric mean of
activities against targets
Step 1 Step 2
Step 3
31. 31
| 31
Automated analysis combines bioassay data with pathway data
From pathways to treatments:
• 88 Targets related to
hyperinsulinism with ≥3
literature references
• Full PathwayStudio
relationship information
• PathwayStudio also has all
compounds suggested as
treatments
Find all targets that could
be used to affect the
disease state
Step 1
32. 32
32
The collaboration analysis
shows clinical centers
specializing in CHI
• Filtered for institutions with > 4
publications and who collaborated
with another institution.
• Size of circle proportional to total
number of publications
• Line width proportional to the number
of co-authored publications
• Lines labeled with DOI’s
Who is collaborating?
33. 33
33
• Filtered for authors with > 3 publication and who collaborated
with another person.
• Size of circle proportional to total number of publications
• Line width proportional to the number of co-authored
publications
• Lines labeled with DOI’s
• Numbers for authors are Scopus ID
Who are the researchers in congenital hyperinsulinism?
37. Burger and Beans – weakly supervised/joint embeddings
37
correct text vector
image vector
Hypersphere of joint
embeddings
incorrect text
vector
Engilberge, Martin, Louis Chevallier, Patrick Pérez and Matthieu Cord. “Finding beans in burgers:
Deep semantic-visual embedding with localization.” CoRR abs/1804.01720 (2018)
39. 39
Ruobing Xie, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2017. Image-embodied
knowledge representation learning. In Proceedings of the 26th International Joint
Conference on Artificial Intelligence (IJCAI'17), Carles Sierra (Ed.). AAAI Press 3140-
3146.
Learning Knowledge Graph relations from images
40. 40
Combining Knowledge
Both, Fabian, Steffen Thoma, and Achim Rettinger. "Cross-modal Knowledge Transfer:
Improving the Word Embedding of Apple by Looking at Oranges." Proceedings of the
Knowledge Capture Conference. ACM, 2017.
41. Conclusion
• We should help researchers do more
• A move towards answers
• Answers come from many sources (text, data, images…)
• Embeddings as mechanism for integration
• Knowledge graphs help integration
42. Thank you
Paul Groth | @pgroth | p.groth@elsevier.com
5
,
2
0
1
8
42
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
43. 43
Combining Knowledge Graphs with Embeddings
Gupta, N., Singh, S., & Roth, D. (2017). Entity linking via joint encoding of types,
descriptions, and context. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing (pp. 2681-2690).
Notas do Editor
Work with dans
Reviewed 400 papers deep dive 114
We need to rely on more unsupervised than supervised techniques. Burger and beans is a weakly supervised which lets infer negatives by knowing what are the positives
through word embeddings can also learn synonyms and such
Concept similarity
Conc svd and pca are combinations