The Ultimate Guide to Choosing WordPress Pros and Cons
End-to-End Learning for Answering Structured Queries Directly over Text
1. Faculty of Science
DL4KG – ESWC 2019
June 2, 2019
End-to-End Learning for
Answering Structured Queries Directly over Text
Paul Groth (@pgroth), Antony Scerri, Ron Daniel, Jr., Bradley P. Allen
@INDE_LAB_AMS @ElsevierLabs
2. Faculty of Science
“An information need is the topic about which the user desires to know
more” – Manning
Information Needs
3. Faculty of Science
Data as an information need
Researchers across communities need a diversity of
observational data, requiring data of different types, from
different sources and disciplines, and often collected at
different scales.
Integrating diverse data is a challenge.
Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (2019). Searching data: A review
of observational data retrieval practices in selected disciplines. Journal of the Association for
Information Science and Technology. https://doi.org/10.1002/asi.24165
4. Faculty of Science
Data search – is it just a regular search engine?
Survey of Research Challenges:
Adriane Chapman, Elena Simperl, Laura Koesten,
George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez,
Emilia Kacprzak, Paul Groth (Jan 2019) "Dataset
search: a survey" https://arxiv.org/abs/1901.00735
5. Faculty of Science
Constructive Data Search
SmartTable: A Spreadsheet Program with Intelligent Assistance, S. Zhang,
V. A. Zada, and K. Balog. In: 41st International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR ’18), July 2018.
6. Faculty of Science
Integration of Data Into Workflows
Chichester, Christine, Daniela Digles, Ronald Siebes, Antonis Loizou, Paul Groth, and
Lee Harland. "Drug discovery FAQs: workflows for answering multidomain drug
discovery questions." Drug discovery today 20, no. 4 (2015): 399-405.
9. Faculty of Science
FIRST: BUILD A KNOWLEDGE GRAPH
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).
10. Faculty of Science
Text Databases
Schneider, Rudolf, et al. "Interactive Relation Extraction in Main Memory Database
Systems." Proceedings of COLING 2016, the 26th International Conference on
Computational Linguistics: System Demonstrations. 2016.
15. Faculty of Science
Now we only need to answer slot filling queries
WikiReading: A Novel Large-scale
Language Understanding Task over
Wikipedia, Hewlett, et al, ACL 2016
Constructing Datasets for Multi-hop Reading Comprehension
Across Documents, Johannes Welbl, Pontus
Stenetorp, Sebastian Riedel, Transactions of the Association
for Computational Linguistics 2018
16. Faculty of Science
Off the shelf QA architectures
Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but
not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning
(CoNLL 2017), pages 271–280, 2017.
Tim Dettmers Isabelle Augenstein Johannes Welbl Tim Rocktaschel Matko
Bosnjak Jeff Mitchell Thomas Demeester Pontus Stenetorp Sebastian Riedel
Dirk Weissenborn, Pasquale Minervini. Jack the Reader – A Machine
Reading Framework. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (ACL) System Demonstrations,
July 2018. URL https://arxiv.org/abs/1806.08727
Jack the Reader – framework for machine reading
https://github.com/uclmr/jack
FastQA – state of the art baseline neural architecture
JackQA – architecture from framework
17. Faculty of Science
Training data
Question:
lexicalize(?city wdt:P131 wd:Q55) =>
located in the administrative territorial entity of
Netherlands
Input Text
“Amsterdam is the capital city and most populous
municipality of the Netherlands. ….”
Answer span
Amsterdam [0,9]
1150 predicates in Wikidata that link entities
Filter
Subject must have a Wikipedia page
> 30 examples
Answer must be in the text
572 predicates
~300 examples per predicate
18. Faculty of Science
- Train a model per predicate
- 2/3 training 1/3 test
- Windowing scheme over the text of articles
- EC2 p2.xlarge
- 1 virtual GPU - NVIDIA K80, 4 virtual CPUs, 61 GiB RAM
- FastQA – 23 hours training time
- JackQA – 81 hours
- restarts to decrease batch sizes if model training failed
Training
24. Faculty of Science
- Joint model
- Model architecture tuned to the task
- Performance on complex queries
- Accuracy
- Speed
- Other datasets
- When to use what approach
- …
Where to go
25. Faculty of Science
• Structured queries are important!
• Can we do it on text? Looks like it … kind of
• Text as the KB – McCallum
• Interested in this kind of stuff?
• We’re hiring!
Questions?
Paul Groth | @pgroth | pgroth.com
indelab.org
Conclusion