Rothamsted Seminar Series by Keywan Hassani-Pak, 1 April 2019
Researchers at Rothamsted and around the world are working to push the boundaries of human knowledge. One would think they have access to the best available tools to help them in their quest for knowledge. In reality the opposite is often true: the research tools at our disposal are only substandard and therefore searching and discovering new biological clues still requires a lot of hard work. We have developed an intelligent data model, known as the KnetMiner Knowledge Graph, that helps researchers to discover new information quickly and easily. Knowledge graphs are commonly used to represent biological entities and their relationships to one another: i.e. things, not strings. Our wheat Knowledge Graph, for example, currently contains more than 1.5 million objects and 6 million facts about, and relations between, these different objects. KnetMiner (www.knetminer.org) enables you to search the Knowledge Graph for genes, phenotypes, diseases, stresses, molecules and more - and instantly tell you the stories of complex traits.
4. Genes are rarely single actors
To explain the metaphor
• Unravelling the roles of genes in complex trait genomics is often similar
to unmasking the heroes and villains in a plant biology who-dunnit
• Multiple genes are generally involved and play different roles. How they
interact reveals the plotline(s) in the trait story
• We build software and data resources to help automate the process of
explaining the plotlines and unravelling the story behind complex trait
biology
• We are creating computational tools that bring different types of
evidence together and then to weigh-up the competing story lines to
present the user with the most compelling ones.
5. Assembling the evidence – data is just the beginning!
Genetic diversity
Sequencing data
Phenotype data
Literature
Pathways
Ontologies
Better yields
NUE
Disease Resistance
Reference genomes
Gene expression
QTL
Data Information Knowledge Understanding
We need all of these components to come together to understand our traits
6. KnetMiner – Accelerating biological discovery
http://knetminer.rothamsted.ac.uk/
@KnetMiner
• Gene Networks
• Bio Databases
• Data Integration
• Text Mining
• Visualisation
• AI & Graphs
• RDF & Neo4j
• Java & JavaScript
8. The Rise of Graph Analytics
From Leonhard Euler to Google’s Knowledge Graphs
9. Seven Bridges of Königsberg - a historically notable problem in mathematics
Its resolution by Leonhard Euler in 1736 laid the foundations of graph theory
Leonhard Euler - 1736
11. Pathfinding
Finds the shortest path or
evaluates route availability
and quality
Centrality
Determines the
importance of distinct
nodes in the network
Community Detection
Evaluates how a group is
clustered or partitioned
Graph Algorithms
22. Towards FAIR KnetMiner Knowledge Graphs
• Green: Ondex plug-ins
• rdf2neo is a generic, non Ondex-specific rdf->Neo4j conversion tool
• Brandizi et al., IB-2018
(https://dx.doi.org/10.1515%2Fjib-2018-0023)
• Brandizi et al., SWAT4LS-2018
(https://doi.org/10.6084/m9.figshare.7314323.v1)
23. Programmatic Access via Graph Query Languages
MATCH
// branching via ‘|’
(prot:Protein) - [:produced_by|consumed_by] -> (:Reaction)
// variable-length chains
- [:part_of*1..3] -> (pway:Path)
RETURN
prot.name, pwy LIMIT 1000
// Very compact forms available:
MATCH (prot:Protein) - (pway:Path) RETURN pway
• RDF + OWL used as a standardised modelling/representation language (see BioKNO
ontology: github.com/Rothamsted/bioknet-onto)
• SPARQL available too, both having pros/cons (see our benchmark:
github.com/Rothamsted/graphdb-benchmarks)
• Cypher being used for KnetMiner queries (work in progress)
24. Open Source Code
client web-server
knet-builder
Deployment Model
Data Integration
Workflows
Database Service Model
Private data
Public data
Databases and graph
queries required by
KnetMiner
For a species or
domain of interest
Graph Queries
Using Knet-Builder tools
25. How to search and interpret so much information?
• Methods needed to evaluate millions of
relationships in knowledge graph, prioritize
genes and extract relevant subgraphs
• Interactive and exploratory tools needed to
enable knowledge discovery
• Interpretation should be the task of domain
experts i.e. biologists!
30. What KnetMiner knows about TT2 (R Myb)
TT2 (R Myb) on chromosome 3D in wheat is predicted (p-value=0.01) to regulate the
transcriptional activation of MFT according to data from the analysis of 850 RNA-seq
samples in wheat (Ramírez-González et al. 2018) using GENIE3 (Huynh-Thu et al. 2010).
The TT2 3B homoeolog is not predicted to regulate MFT, and the TT2 3A homoeolog is not
annotated in the latest version of the wheat genome.
MFT has been recently linked to grain germination [(Zong Y ; Li Q ): “Recent studies in
both Arabidopsis and wheat have uncovered a new role of MOTHER OF FT AND TFL1
(MFT) in seed germination”] and seed dormancy [(Nakamura S ): “Mapping analysis
showed that MFT on chromosome 3A (MFT-3A) colocalized with the seed dormancy
quantitative trait locus (QTL) QPhs.ocs-3A.”].
The MFT ortholog in Arabidopsis has a 3’ UTR variant that has been associated with (p-
value = 5.5x10-5) increased germination rate after 56 days of dry storage (Atwell et al.
2010).
31. DFW Nov 2018
Visualising connections can lead to new lines of inquiries
• Can grain colour and PHS be linked because R Myb targets the grain germination gene MFT?
• Do white grain varieties (R Myb mutants) have increased root hair density?
• Is there a link between root hair density and PHS?
32. What does KnetMiner know
about your trait?
Forward genetics applications: phenotype to gene
35. IRRI Germplasm Acquisition of early seedling
traits and image processing
GWAS
Gene discovery
related to seed
vigour
Guillaume Menard
Smita Kurup
Peter Eastmond
Kirsty Hassall
David Hughes
Colin Li
Direct Seeded Rice
Improve seedling
establishment,
emergence and
vigour
http://knetminer.rothamsted.ac.uk/Oryza_sativa/
40. Future development
• Personalised search experience
• Save and share your networks
• Like and Dislike buttons on relations
• Personalised networks based on usage data
• Better knowledge visualisation
• Reduce information overload using network clustering
• Annotate nodes with quantitative user data
• Predictive network analysis
• Find research trends using publication and graph data
• Automatic story and hypotheses generation
41. Future vision – Automatic story generation
TT2 (R Myb) on chromosome 3D in wheat is predicted (p-value=0.01) to regulate the transcriptional activation of MFT
according to data from the analysis of 850 RNA-seq samples in wheat (Ramírez-González et al. 2018) using GENIE3
(Huynh-Thu et al. 2010). The TT2 3B homeologue is not predicted to regulate MFT, and the TT2 3A homeologue is not
annotated in the latest version of the wheat genome. MFT has been recently linked to grain germination [(Zong Y ; Li Q ):
“Recent studies in both Arabidopsis and wheat have uncovered a new role of MOTHER OF FT AND TFL1 (MFT) in seed
germination”] and seed dormancy [(Nakamura S ): “Mapping analysis showed that MFT on chromosome 3A (MFT-3A)
colocalized with the seed dormancy quantitative trait locus (QTL) QPhs.ocs-3A.”]. The MFT ortholog in Arabidopsis has a
3’ UTR variant that has been associated with (p-value = 5.5x10-5) increased germination rate after 56 days of dry
storage (Atwell et al. 2010).
• Well structured semantics of entities and
relationships in the network
• Maintaining provenance of derived
relations
• Confidence values on relations
• Supporting literature
42. KnetMiner Impact
• Over 1800 unique users last year
• 68% of users from non-UK countries
• KnetMiner part of 3 grants recently submitted to BBSRC (BBR, sLoLa)
• KnetMiner code is open-source; v3.0 released in Feb 2019
• Developers are starting to contribute to our open-source tools
• Resources are starting to link into KnetMiner (eg. WheatIS, T3, Ensembl)
• Requests to build KnetMiner for new species (eg. Sugarcane, honey bee)
• Invited to run training courses and workshops (eg. EBI, PAG, IB2018)
• KnetMiner used in two agrifood companies, several companies have
expressed interest
43. The right data in the right
format and in the right hands at
the right time, saves lives.
#us2ts2019 @hdeus
44. Acknowledgements (strings - for humans only)
Bioinformatics Lab
Ajit Singh
Marco Brandizi
Sandeep Amberkar
Emma Bailey
Dan Smith
David Hughes
Rob King
Colin Li
Chris Rawlings
William Brown (ITS)
Follow us on Twitter: @KnetMiner
Collaborators & Contributors
Richard Holland (NFVL)
Misha Kapushesky (Genestack)
Kevin Dialdestoro (Genestack)
Vasiliki Koutra (KCL)
Martin Castellote (INTA)
Philipp Bayer (UWA)
Jean-Luc Jannink (Cornell)
Clay Birkett (Cornell)
Cyril Pommier (INRA)
Ramil Mauleon (IRRI)
Jan Taubert (KWS)
Uwe Scholz (IPK)
Matthias Lange (IPK)
Kumar Saurabh Singh (Exeter)
Monika Mistry
DFW WP4 members
Data Contributors
Clay Birkett (Cornell)
Cristobal Uauy (JIC)
Philippa Borrill (JIC)
Andrea Bräutigam (Uni Bielefeld)
Philipp Bayer (UWA)
Ramil Mauleon (IRRI)
Funding
Designing Future Wheat (BBSRC)
Pest Genomics Initiative
Users
Andy Phillips (RRes)
Rowan Mitchell (RRes)
Steve Hanley (RRes)
Richard Barker (NASA)
Good afternoon. Thank you for the kind introduction. Coincidently, I’ve been with Rothamsted 11-years today.
I joined the BAB department in 2008, and now as Head of Bioinformatics, my team and I have worked and collaborated with many of you.
Researchers at Rothamsted and around the world are working to push the boundaries of human knowledge and tackle key challenges like sustainable agriculture, climate change and food security.
You would think they have access to the best available tools to help them in their quest for knowledge. In reality the opposite is often true: the research tools at our disposal are only substandard - finding and accessing relevant information can be extremely tedious.
In this seminar, I am going to present how we developed an intelligent data model, known as the KnetMiner Knowledge Graph, that helps you as a researcher to discover new information quickly and easily.
My talk is structured in 5 sections
What do we understand with biological knowledge discovery?
What are graphs and how are they useful?
How do we build the KnetMiner knowledge graph?
Finally I will show some use cases and discuss future work.
There are many types of biological knowledge discovery, e.g. Drug discovery, trait discovery or gene discovery.
Today I will mostly focus on gene discovery, so what is gene discovery?
Let me use the metaphor “Genes are rarely single actors” to explain what we mean with gene discovery.
Assembling the evidence is a tedious job for humans. Machines can help but face challenges. Let’s review the type of evidences needed to go from data to understanding.
Clifford Stoll, an American Astronomer, said: “Data is not information, [...]”
Data comes in many formats
Excel spreadsheets, FASTA, GFF, Tabular, XML and you name it. FAIR.
Information is in many locations and interconnected
Over 1600 biological databases (NAR Database Issue 2019)
Knowledge can be buried in free text or in structured form (like ontologies)
Over 29 million articles (PubMed) - Over 750 biomedical ontologies (BioPortal)
Ontologies are structured keyword lists that life scientists have created to define and connect concepts in their subject areas.
Understanding needs intelligence
Human vs artificial
If all this layers are structured such that machines can understand what we mean (FAIR), then machines can play a crucial role in helping us to understand traits.
KnetMiner, with a silent "K" and standing for Knowledge Network Miner, is a web-app that makes biological knowledge discovery faster and fun.
It is used by researchers at Rothamsted and around the world, 68% of users are from outside of UK
Ajit Singh works on the front-end and usability of KnetMiner.
Marco Brandizi works on the back-end technologies and interoperability of KnetMiner.
KnetMiner is NOT a black box.
It is powered by a large integrated knowledge graph and graph analytics algorithms to search and visualise knowledge.
Graph analytics has a history dating back to 1736, when Leonhard Euler solved the “Seven Bridges of Königsberg” problem.
The problem asked whether it was possible to visit all four areas of a city, connected by seven bridges, while only crossing each bridge once.
With the insight that only the connections themselves were relevant, Euler set the groundwork for graph theory and its mathematics.
Euler modelled the problem as a graph of 4 x nodes and 7 x edges
Euler proved that in any graph such walks are only possible if exactly zero or two nodes have an odd number of edges.
The four areas (nodes) of Königsberg however all had odd number of bridges (edges). So it wasn’t possible!
But graph analytics did not catch on immediately. Two hundred years would pass before the first graph textbook was published in 1936.
In the late 1960s and 1970s, network science and applied graph analytics really began to emerge.
In the last few years, there’s been an explosion of interest in graph technologies. In 2017, a survey (Forrester data) indicated that “69% of enterprises have or plan to implement graph databases within the next 12 months.” Demand is accelerating based on a need to better understand real-world networks and forecast their behaviours, which is resulting in many new graph-based solutions.
Networks are a representation, a tool to understand complex systems and the complex connections inherent in today’s data.
For example, you can represent how protein or social systems work by thinking about interactions between pairs of proteins or people.
By analyzing the structure of this representation, we answer questions and make predictions about how the system works or how individuals behave within it.
In this sense, network science is a set of technical tools applicable to nearly any domain, and graphs are the mathematical models used to perform analysis.
Most graph algorithms can be classified in three major groups: pathfinding, centrality and community detection
KnetMiner uses algorithms from all three categories to…
Find the shortest path or evaluate route availability and quality (Pathfinding)
Determine the importance of distinct nodes in the network (Centrality)
Visualise graphs based on how a group is clustered or partitioned (Community detections)
Knowledge graphs are special type of graph. They aim to structure what is known about the world.
Nodes and relations have a type (Institute, Site, based in, distance to)
Relations are based on known facts (nano-publications)
Properties (miles, year)
There are often different ways to model the real-world: example year. Always ask your knowledge engineer!
The knowledge graph encodes relationships between entities as an explicit relationship in a database, and not just in a sentence on a document, essentially making it machine readable.
KGs makes information that much easier to discover and enable world-aware inferences.
Google is building their own KG, let’s take a look how it’s used.
https://paul4innovating.com/2018/05/15/the-arrival-and-potential-of-knowledge-graphs-into-our-world/simple-enough-for-knowledge-graphs/
The Knowledge Graph can help you make some unexpected discoveries.
You might learn a new fact or new connection that prompts a whole new line of inquiry.
Do you know where Matt Groening, the creator of the Simpsons, got the idea for Homer, Marge and Lisa’s names? It’s a bit of a surprise.
All of these are linked in the graph. It’s not just a catalog of objects; it also models all these inter-relationships.
It’s the intelligence between these different entities that’s the key.
I am now going to introduce the KnetMiner KG – explain what information it captures and how we build it.
What types of entities do we need for evidence-based gene discovery?
Gene, RNA, Protein, Metabolite, SNP, Phenotype, Publication: These are some examples of things we capture in our KG.
A KG can always be extended with other types of entities.
The intelligence between these different entities is the key.
Where do KnetMiner relations come from?
This is taken from the wheat knetminer release notes, relations come from:
Public databases
Pre-publication collaborative datasets
Extracted from the literature (pubmed abstracts)
Generated by KnetMiner itself
So how do we convert datasets into a graph?
Explain columns
The relationship between the columns is hard to understand for machines.
But when we transform the table into a graph of things and their explicit relationships, the data becomes more meaningful to machines.
And connections are easier to “see” for humans.
If we go and take a second dataset, like PPI from BioGrid…
We can overlay it over the graph and build a more comprehensive biological picture
KnetMiner KGs can have millions of nodes and edges. How do we store it and make it accessible?
We have made important progress towards making the KG FAIR using new technologies to replace the existing Ondex backend
Neo4j graph database with Cypher graph query language
Semantic Web (RDF) knowledgebase with SPARQL endpoint
Graph Query Languages allow us to query the knowledge programmatically and build intelligent search tools that accelerate discovery.
If you like to know more about this work, please read our recent publications.
In simple terms, once we have the KG, methods are needed to evaluate…
Hypotheses validation and generation
Mining in public databases
No statistically significant hits (genes) in AraGWAS
But biologically significant path in the knowledge graph exist
Analyse your private GWAS/QTL data
By Helena Deus
Acknowledgements: Various collaborators who have worked with us on KnetMiner, including contributing partners from science, academia and industry.
Acknowledgements: Various collaborators who have worked with us on KnetMiner, including contributing partners from science, academia and industry.
Note: Keywan (the “K”)