From Text to Data to the World: The Future of Knowledge Graphs

From Text to Data to
the World: The Future
of Knowledge Graphs
Paul Groth | pgroth.com | @pgroth
Thanks to: Matthew Clark, Frederik van den Broek, Anton Yuryev, Maria Shkrob, Sherri Matis-Mitchell,
Timothy Hoctor, Brad Allen, Corey Harper, Ron Daniel, Helena Deus, Olaf Lodbrok

June 15, 2018
• Research productivity
• Moving to answers – knowledge graphs
• Building knowledge graphs – from text
• Building knowledge graphs – from data
• Combining knowledge graphs
2

June 15, 2018
3
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf

June 15, 2018
4
Economic Research.
ideas.pdf

June 15, 2018
5
Economic Research.
ideas.pdf

June 15, 2018
6
Economic Research.
ideas.pdf

June 15, 2018
7
Economic Research.
ideas.pdf

June 15, 2018
8
Economic Research.
ideas.pdf

IN PRACTICE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2017).
Searching Data: A Review of Observational Data Retrieval Practices.
arXiv preprint arXiv:1707.06937.
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g.
early career researchers, policy makers, students) are not
well documented.
• Participants require details about data collection and
handling
• Reconstructing data tables from journal articles,
using general search engines, and making direct data
requests are common.
K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018).
Understanding Data Retrieval Practices: A Social Informatics Perspective.
arXiv preprint arXiv:1801.04971

ELSEVIER’S BUSINESS: PROVIDING ANSWERS FOR
RESEARCHERS, DOCTORS AND NURSES
My work is moving towards a new field; what should I know?
• Journal articles, reference works, profiles of researchers, funders &
institutions
• Recommendations of people to connect with, reading lists, topic pages
How should I treat my patient given her condition & history?
• Journal articles, reference works, medical guidelines, electronic health
records
• Treatment plan with alternatives personalized for the patient
How can I master the subject matter of the course I am taking?
• Course syllabus, reference works, course objectives, student history
• Quiz plan based on the student’s history and course objectives

THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
ANSWERS ARE ABOUT THINGS, NOT JUST WORKS
Why shouldn’t a search on an author return
information about the author, including the author’s
works? Where was the author born, when did she live,
what is she known for? … All of this is possible, but
only if we can make some fundamental changes in our
approach to bibliographic description. ... The challenge
for us lies in transforming what we can of our data into
interrelated “things” without overindulging that
metaphor.
Coyle, K. (2016). FRBR, before and after: a look at our bibliographical
models. Chicago: ALA Editions.

THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
KNOWLEDGE GRAPHS DEFINED
• Knowledge graphs are "graph structured knowledge bases (KBs) which store factual
information in form of relationships between entities”
• (Nickel, M., Murphy, K., Tresp, V. and Gabrilovich, E. (2015). A review of relational machine learning for knowledge graphs. arXiv:1503.00759v3)
• Knowledge graphs are metadata evolved beyond the focus on the work, linking people, concepts,
things and events
• Knowledge Graphs are focused on things to provide answers

The Success of Knowledge Graphs
13
June 15, 2018

Knowledge Graphs at Elsevier
14
June 15, 2018

BUILDING
KNOWLEDGE GRAPHS
FROM TEXT

• Total concepts = 540,632
• 100+ person years of clinical expert
knowledge
ONTOLOGY MAINTENANCE

17
One Weird Trick from Natural Language Processing (NLP)
• Knowledge bases are populated by scanning text and doing Information Extraction
• Most information extraction systems are looking for very specific things, like drug-drug interactions
• Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text
• For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar
• The weird trick for open information extraction … a simple algorithm, known as ReVerb*:
1. Find “relation phrases” starting with a verb and ending with a verb or preposition
2. Find noun phrases before and after the relation phrase
3. Discard relation phrases not used with multiple combinations of arguments.
In addition, brain scans were performed to exclude
other causes of dementia.
* Fader et al. Identifying Relations for Open Information Extraction

18
ReVerb output
# SD Documents Scanned 14,000,000
Extracted ReVerb Triples 473,350,566

ONTOLOGY MAINTENANCE
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).

CHALLENGES
Paul Groth, Michael Lauruhn, Antony Scerri. Ron Daniel: “Open Information
Extraction on Scientific Text: An Evaluation”, 2018; [http://arxiv.org/abs/1802.05574
arXiv:1802.05574] – To appear in COLING 2018
698 unique relations
400 sentences

BUILDING
KNOWLEDGE GRAPHS
FROM DATA

22
Medical Graph – Statistical correlations at scale
I65
Occlusion and stenosis
of precerebral arteries
G40
Epilepsy
has_successor
I61
C71
Malignant neoplasm
of brain
odds ratio: 1.12
intracerebral
hemorrhage has_successor criteria1:
• Correlation selected by
preditive modeling
algorithmus
• No. of relations is higher
than in mirrored relation
• p-value < 0,05
• Odds ratios balanced over
all covariates.
1 Criteria based on: Jensen et.al.: Temporal disease trajectories condensed from population-wide registry data covering
6.2 million patients. Nature Communications, 2014 Jun 24 ;5:4022. doi: 10.1038/ncomms5022.
Other
covariates
Primary care
Secondary care
Drug prescriptions
5m patients
each 6 years longitudinality

23
Medical Graph in practice, patient 35: risk of depression
• 49 year old man
• Dx: overweight,
diabetes,
hypertension,
anxiety disorder
 has an absolute
risk of 36% to
develop a
depression within
the next 4 years

24
… and rationale of why model thinks this

25
• Targets for prediction: ICD-coded diagnoses
• Only incident patients per diagnose considered, i.e. diagnosis-free 2009 – 2010
• if these patients remain diagnosis-free 2011 - 2014 (observation period), then 0 else 1
• Covariates: all ICD-/ATC-codes, age and sex measured in 2010
Example: Model to predict „I50 – Heart Failure“
25
Predict 4 year long-term effects, balanced for all co-variables
I50 -
I50 free patients
2009 2010
time
I50 -
(coded
as 0)
I50 +
(coded
as1)
2011 2014
Covariates
Remaining I50 free patients/ newly I50 diagnosed patients

26
Technology
stack
feature
extraction
For 3.8m patients:
• age, gender
• all diagnoses: ICD10-coded, 3 digits, i.e. 2054 codes
• all medications: ATC-coded, 5 digits, i.e. 906 codes
• death, hospitalization
Results in: 6277 features
• 1623 targets, 2011-2014
• 2320 covariates, 2010
• 2334 filter-columns, 2009-2010
data mining Calculate prevalence, incidence, mean age for all covariates (i.e.
diseases and medications)
machine
learning
Predictive modelling for ~1600 targets
• Linear classification model, resulting in odds ratios
• Calculation of p-values
Calculate statistics & build prediction models for ~1600 targets

28
| 28
• A rare genetic disease
• Permanently excessive level of
insulin in the blood
• Develops within the first few days of life
Symptoms include floppiness, shakiness, poor
feedings, seizures, fits and convulsions.
• If not caught quickly can lead to brain
injury or even death.
• In the most severe cases the only viable
treatment is the removal of the
pancreas, consigning the patient to a
lifetime of diabetes.
Example: Treatments for Congenital Hyperinsulinism
is a UK
charity that is building the rare
disease community to raise
awareness, drive research and
develop treatments.
is partnering
with Findacure scientists to help
identify and evaluate treatments
for this devastating disease.

29
29
Biological Pathways
extracted via semantic
text mining
A upregulates B
B upregulates C
C increases
Disease
Normalizing vocabularies required: proteins, diseases, drugs, chemicals
A  B  C 
disease
Bioactivities
through text analysis
IC50 6.3nM, kinase
binding assay 10mM
concentration
Chemical Structures
And Properties
InChi,
Name
NCBI,
Uniprot
EMTREE
ReaxysTree,
Structures

30
| 30
From pathways to treatments:
Biovia PipelinePilot implementation combines data sources
Automated analysis combines bioassay data with pathway data
Find all targets that could
be used to affect the
disease state
Query for each target to find
the activities for each
compound that are >6 log units
Collate data by compound to summarize the
targets/activities related to disease that the
compound hits
• Compute geometric mean of activities for ranking
• Rank by number of targets and geometric mean of
activities against targets
Step 1 Step 2
Step 3

31
| 31
Automated analysis combines bioassay data with pathway data
From pathways to treatments:
• 88 Targets related to
hyperinsulinism with ≥3
literature references
• Full PathwayStudio
relationship information
• PathwayStudio also has all
compounds suggested as
treatments
Find all targets that could
be used to affect the
disease state
Step 1

32
32
The collaboration analysis
shows clinical centers
specializing in CHI
• Filtered for institutions with > 4
publications and who collaborated
with another institution.
• Size of circle proportional to total
number of publications
• Line width proportional to the number
of co-authored publications
• Lines labeled with DOI’s
Who is collaborating?

33
33
• Filtered for authors with > 3 publication and who collaborated
with another person.
• Size of circle proportional to total number of publications
• Line width proportional to the number of co-authored
publications
• Lines labeled with DOI’s
• Numbers for authors are Scopus ID
Who are the researchers in congenital hyperinsulinism?

Embeddings & Linked Prediction
Pierre-Yves Vandenbussche
(@pyvandenbussche)
Translating Embeddings (TransE)
http://pyvandenbussche.info/2017/tran
slating-embeddings-transe/

Pierre-Yves
Vandenbussche
(@pyvandenbussche)
Translating
Embeddings (TransE)
http://pyvandenbussc
he.info/2017/translatin
g-embeddings-transe/

Pierre-Yves Vandenbussche
(@pyvandenbussche)
Translating Embeddings
(TransE)
http://pyvandenbussche.info/201
7/translating-embeddings-
transe/

Burger and Beans – weakly supervised/joint embeddings
37
correct text vector
image vector
Hypersphere of joint
embeddings
incorrect text
vector
Engilberge, Martin, Louis Chevallier, Patrick Pérez and Matthieu Cord. “Finding beans in burgers:
Deep semantic-visual embedding with localization.” CoRR abs/1804.01720 (2018)

Burger and Beans Architecture
June 15, 2018
38

39
Ruobing Xie, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2017. Image-embodied
knowledge representation learning. In Proceedings of the 26th International Joint
Conference on Artificial Intelligence (IJCAI'17), Carles Sierra (Ed.). AAAI Press 3140-
3146.
Learning Knowledge Graph relations from images

40
Combining Knowledge
Both, Fabian, Steffen Thoma, and Achim Rettinger. "Cross-modal Knowledge Transfer:
Improving the Word Embedding of Apple by Looking at Oranges." Proceedings of the
Knowledge Capture Conference. ACM, 2017.

Conclusion
• We should help researchers do more
• A move towards answers
• Answers come from many sources (text, data, images…)
• Embeddings as mechanism for integration
• Knowledge graphs help integration

Thank you
Paul Groth | @pgroth | p.groth@elsevier.com
5
,
2
0
1
8
42
Economic Research.
ideas.pdf

43
Combining Knowledge Graphs with Embeddings
Gupta, N., Singh, S., & Roth, D. (2017). Entity linking via joint encoding of types,
descriptions, and context. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing (pp. 2681-2690).

From Text to Data to the World: The Future of Knowledge Graphs

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a From Text to Data to the World: The Future of Knowledge Graphs

Semelhante a From Text to Data to the World: The Future of Knowledge Graphs (20)

Mais de Paul Groth

Mais de Paul Groth (13)

Último

Último (20)

From Text to Data to the World: The Future of Knowledge Graphs

Notas do Editor