Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
Handwritten Text Recognition for manuscripts and early printed texts
The Challenge of Deeper Knowledge Graphs for Science
1. THE CHALLENGE OF
DEEPER KNOWLEDGE
GRAPHS FOR SCIENCEPAUL GROTH | @PGROTH | PGROTH.COM
CONTRIBUTIONS: RON DANIEL, MICHAEL LAURUHN & @ELSEVIERLABS TEAM
3. Bloom, N., Jones, C. I., Van Reenen, J., & Webb,
M. (2017). Are ideas getting harder to find? (No.
w23782). National Bureau of Economic
Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
4. Bloom, N., Jones, C. I., Van Reenen, J., & Webb,
M. (2017). Are ideas getting harder to find? (No.
w23782). National Bureau of Economic
Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
5. Bloom, N., Jones, C. I., Van Reenen, J., & Webb,
M. (2017). Are ideas getting harder to find? (No.
w23782). National Bureau of Economic
Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
6. Bloom, N., Jones, C. I., Van Reenen, J., & Webb,
M. (2017). Are ideas getting harder to find? (No.
w23782). National Bureau of Economic
Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
7. Bloom, N., Jones, C. I., Van Reenen, J., & Webb,
M. (2017). Are ideas getting harder to find? (No.
w23782). National Bureau of Economic
Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
9. WHY?
IN PRACTICE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2017).
Searching Data: A Review of Observational Data Retrieval Practices.
arXiv preprint arXiv:1707.06937.
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g.
early career researchers, policy makers, students) are
not well documented.
• Participants require details about data collection and
handling
• Reconstructing data tables from journal articles,
using general search engines, and making direct
data requests are common.
K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018).
Understanding Data Retrieval Practices: A Social Informatics Perspective.
arXiv preprint arXiv:1801.04971
10. THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
ANSWERS ARE ABOUT THINGS, NOT JUST WORKS
Why shouldn’t a search on an author return information
about the author, including the author’s works? Where
was the author born, when did she live, what is she
known for? … All of this is possible, but only if we can
make some fundamental changes in our approach to
bibliographic description. ... The challenge for us lies in
transforming what we can of our data into
interrelated “things” without overindulging that
metaphor.
Coyle, K. (2016). FRBR, before and after: a look at our bibliographical
models. Chicago: ALA Editions.
11. ENTER
KNOWLEDGE
GRAPHS
ERNST, PATRICK, ET AL. "DEEPLIFE: AN ENTITY-
AWARE SEARCH, ANALYTICS AND EXPLORATION
PLATFORM FOR HEALTH AND LIFE SCIENCES."
PROCEEDINGS OF ACL-2016 SYSTEM
DEMONSTRATIONS (2016): 19-24.
15. 15
Augenstein, Isabelle, et al. "SemEval 2017 Task 10:
ScienceIE-Extracting Keyphrases and Relations from
Scientific Publications." Proceedings of the 11th
International Workshop on Semantic Evaluation
(SemEval-2017). 2017.
SCIENTIFIC TEXT IS CHALLENGING
16. UNSUPERVISED & DISTANT SUPERVISION
EXAMPLE: UNIVERSAL SCHEMAS AND REVERB
Groth et al., Applying Universal Schemas for Domain Specific Ontology Expansion http://www.akbc.ws/2016/papers/3_Paper.pdf
• Successful in predicting new triples
(F1 =~ .7)
• ReVerb’s relations very interesting,
but recall very low
• Was not domain independent
• Matched arguments against a
medical ontology to improve
precision
• Predicted relations were restricted
to relation types from the same
ontology
17. OPEN INFORMATION EXTRACTION IN SCIENCE IS
HARD
Open Information Extraction on Scientific Text: An Evaluation.
Paul Groth, Mike Lauruhn, Antony Scerri and Ron Daniel, Jr.. COLING
2018
Example:
“The patient was treated with Emtricitabine,
Etravirine, and Darunavir”
‣ (The patient :: was treated with :: Emtricitabine,
Etravirine, and Darunavir)
Another possible extraction is:
‣ (The patient :: was treated with :: Emtricitabine)
‣ (The patient :: was treated with :: Etravirine)
‣ (The patient :: was treated with :: Darunavir)
698 unique relation types – 400 relation types
18. CROWDS ARE NOT EXPERTS
Use of Internal Testing Data to Help Determine Compensation for
Crowdsourcing Tasks
Michael Lauruhn, Paul Groth, Corey Harper, Helena Deus. HUML 2018
23. SOURCES AREN’T JUST DATA
Lauruhn, Michael, and Paul Groth. "Sources of
Change for Modern Knowledge Organization
Systems." Knowledge Organization 43, no. 8
(2016).
24. A MORE TRANSPARENT SUPPLY CHAIN
Groth, Paul, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-
April 2013 doi: 10.1109/MIC.2013.41
33. RESEARCH QUESTIONS
1. Does basic lab-based
biomedical research reuse
and assemble existing
methods, or is it primarily
focused on the development
of new techniques?
2. What existing methods are
covered by robotic labs?
36. CONCLUSIONS
▸Knowledge Graphs are crucial for overcoming information overload in research
▸Research has less redundancy than other domains
▸less resources and high diversity
▸challenge: effectively use general knowledge in these domains
▸Quality is central
▸turn towards processes and reproducibility as foundations
Editor's Notes
Work with dans
Reviewed 400 papers deep dive 114
Cloud based labs provide remote access to frequently used experimental equipment
Able to support increasingly complex protocols
(e.g. transcriptic.com , emerald cloud lab)