End-to-End Learning for Answering Structured Queries Directly over Text

Faculty of Science
DL4KG – ESWC 2019
June 2, 2019
End-to-End Learning for
Answering Structured Queries Directly over Text
Paul Groth (@pgroth), Antony Scerri, Ron Daniel, Jr., Bradley P. Allen
@INDE_LAB_AMS @ElsevierLabs

Faculty of Science
“An information need is the topic about which the user desires to know
more” – Manning
Information Needs

Faculty of Science
Data as an information need
 Researchers across communities need a diversity of
observational data, requiring data of different types, from
different sources and disciplines, and often collected at
different scales.
 Integrating diverse data is a challenge.
Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (2019). Searching data: A review
of observational data retrieval practices in selected disciplines. Journal of the Association for
Information Science and Technology. https://doi.org/10.1002/asi.24165

Faculty of Science
Data search – is it just a regular search engine?
Survey of Research Challenges:
Adriane Chapman, Elena Simperl, Laura Koesten,
George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez,
Emilia Kacprzak, Paul Groth (Jan 2019) "Dataset
search: a survey" https://arxiv.org/abs/1901.00735

Faculty of Science
Constructive Data Search
SmartTable: A Spreadsheet Program with Intelligent Assistance, S. Zhang,
V. A. Zada, and K. Balog. In: 41st International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR ’18), July 2018.

Faculty of Science
Integration of Data Into Workflows
Chichester, Christine, Daniela Digles, Ronald Siebes, Antonis Loizou, Paul Groth, and
Lee Harland. "Drug discovery FAQs: workflows for answering multidomain drug
discovery questions." Drug discovery today 20, no. 4 (2015): 399-405.

Faculty of Science
Run structured queries

Faculty of Science
https://kgtutorial.github.io
FIRST: BUILD A KNOWLEDGE GRAPH

Faculty of Science
FIRST: BUILD A KNOWLEDGE GRAPH
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).

Faculty of Science
Text Databases
Schneider, Rudolf, et al. "Interactive Relation Extraction in Main Memory Database
Systems." Proceedings of COLING 2016, the 26th International Conference on
Computational Linguistics: System Demonstrations. 2016.

Faculty of Science
Can you skip all that?

Faculty of Science
Machine Comprehension + Question Answering Tasks
https://nlp.stanford.edu/software/sempre/wikitable/

Faculty of Science
What if we have a parallel corpora

Faculty of Science
Triple Pattern Fragments
http://linkeddatafragments.org/concept/

Faculty of Science
Now we only need to answer slot filling queries
WikiReading: A Novel Large-scale
Language Understanding Task over
Wikipedia, Hewlett, et al, ACL 2016
Constructing Datasets for Multi-hop Reading Comprehension
Across Documents, Johannes Welbl, Pontus
Stenetorp, Sebastian Riedel, Transactions of the Association
for Computational Linguistics 2018

Faculty of Science
Off the shelf QA architectures
Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but
not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning
(CoNLL 2017), pages 271–280, 2017.
Tim Dettmers Isabelle Augenstein Johannes Welbl Tim Rocktaschel Matko
Bosnjak Jeff Mitchell Thomas Demeester Pontus Stenetorp Sebastian Riedel
Dirk Weissenborn, Pasquale Minervini. Jack the Reader – A Machine
Reading Framework. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (ACL) System Demonstrations,
July 2018. URL https://arxiv.org/abs/1806.08727
Jack the Reader – framework for machine reading
https://github.com/uclmr/jack
FastQA – state of the art baseline neural architecture
JackQA – architecture from framework

Faculty of Science
Training data
Question:
lexicalize(?city wdt:P131 wd:Q55) =>
located in the administrative territorial entity of
Netherlands
Input Text
“Amsterdam is the capital city and most populous
municipality of the Netherlands. ….”
Answer span
Amsterdam [0,9]
1150 predicates in Wikidata that link entities
Filter
 Subject must have a Wikipedia page
 > 30 examples
 Answer must be in the text
572 predicates
~300 examples per predicate

Faculty of Science
- Train a model per predicate
- 2/3 training 1/3 test
- Windowing scheme over the text of articles
- EC2 p2.xlarge
- 1 virtual GPU - NVIDIA K80, 4 virtual CPUs, 61 GiB RAM
- FastQA – 23 hours training time
- JackQA – 81 hours
- restarts to decrease batch sizes if model training failed
Training

Faculty of Science
Training data size as a factor?

Faculty of Science
A Prototype

Faculty of Science
- Joint model
- Model architecture tuned to the task
- Performance on complex queries
- Accuracy
- Speed
- Other datasets
- When to use what approach
- …
Where to go

Faculty of Science
• Structured queries are important!
• Can we do it on text? Looks like it … kind of
• Text as the KB – McCallum
• Interested in this kind of stuff?
• We’re hiring!
Questions?
Paul Groth | @pgroth | pgroth.com
indelab.org
Conclusion

End-to-End Learning for Answering Structured Queries Directly over Text

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to End-to-End Learning for Answering Structured Queries Directly over Text

Similar to End-to-End Learning for Answering Structured Queries Directly over Text (20)

More from Paul Groth

More from Paul Groth (12)

Recently uploaded

Recently uploaded (20)

End-to-End Learning for Answering Structured Queries Directly over Text

Editor's Notes