Presented at D4 2020
The presentation focuses on the application of knowledge graph approach in profiling drug targets. The speaker, Nolan Nichols, highlights the potential of genetic modifiers in developing transformative therapies for patients suffering from diseases such as spinal muscular atrophy (SMA). The company, Maze Therapeutics, is utilizing advanced data science and collaboration with AWS healthcare and life sciences to access and analyze meaningful human genetics data, and build a cloud-based data architecture. The presentation also covers the use of semantic technologies in drug discovery, target discovery, and target validation, and the integration of proprietary and shared data through the knowledge graph. The presentation concludes with a summary of the company's launch in 2019 with a $190 million investment and its focus on translating genetic modifying insights into new therapeutics.
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
Focus on the Evidence: a knowledge graph approach to profiling drug targets
1. Nolan Nichols, PhD
Maze Therapeutics
October 15, 2020
Focus on the evidence: a
knowledge graph approach to
profiling drug targets
D4 GLOBAL
2. why do some people get sick and
others don’t, even when they have
the same disease-causing gene?
2
3. 3
genetic modifiers are naturally occurring and can be identified
in 2016, the Resilience Project published that
they had identified individuals who should have
serious childhood diseases, but didn’t,
describing potential genetic modifiers
Chen et al. Nat Biotechnology 2016
4. 4
Dr. Jonathan Weissman and
team observed that some
gene-gene interactions have a
‘buffering’ or protective effect
on disease-causing mutations
Horlbeck et al. Cell 2018
CRISPRi technology developed by the Weissman lab at UCSF enabled mapping of
genetic interactions at scale
5. based on genetic insights, genetic modifier targets can be developed into
transformative therapies for patients
5
protective modifiers can…
be discovered from, or
validated by, functional
genomics data
be targeted to
develop new
therapeutics
be identified from
human genetic data that
naturally protect some
people from disease
6. disease-causing gene genetic modifier therapy
SMN1 mutations
leads to SMA
treat by increasing SMN2 copy number
to mimic genetic modifier
maze has identified many diseases for which its
platform can transform genetic modifier insights to
novel therapies
SMN2 overproduction can compensate
for SMN1 in SMA patients
an example of a known genetic modifier inspiring a novel treatment for
spinal muscular atrophy (SMA)
6
7. our purposely built approach: maze is translating genetic modifying insights into
transformative therapies for patients
Our current research areas:
• Mendelian diseases
• Genetic modifiers
Potential future research areas:
• Polygenic diseases
• Haploinsufficiency
advanced data science for analysis of large, integrated data
proprietary cohort
data for maze
pay for access
access public data
genome-
wide
CRISPR
screens
single-cell
biology
cellular
disease
modeling
inter-
actomics
mutational
scanning
future
innovation
access and analyze meaningful
human genetics data
elucidate target biology leveraging
functional genomics
efficiently prosecute drug
discovery with multiple modalities
7
maze is generating proprietary data on genetic modifiers discovered
from integrated human genetic and functional genomic data
8. access and analyze meaningful
human genetics data
elucidate target biology leveraging
functional genomics
integrated human genetic and functional genomic data lowers barriers to analysis and
answering questions
proprietary cohort
data for maze
pay for access
access public data
genome-
wide
CRISPR
screens
single-cell
biology
cellular
disease
modeling
inter-
actomics
mutational
scanning
future
innovation
8
advanced data science for analysis of large, integrated data
https://www.anaconda.com/state-of-data-science-2020
a 2020 survey of 2,360 data professionals from 100 countries
indicates that “For most respondents, data management tasks
still consume a disproportionate amount of work time.”
n=1099
9. 9
collaboration with AWS healthcare and life sciences supports a cloud-based data
architecture
visualizationcomputation
graph
database
publication
open data
bioinformatician biologist chemist
cloud compute layer (aws biotech blueprint)
data persistence layer
data management layer
data access layer
object
store
relational
database
(meta)data servicesgovernance
https://aws.amazon.com/quickstart/biotech-blueprint/ FAIR Principles: https://doi.org/10.1038/sdata.2016.18
10. 10
there are many technologies that can be used to
construct a knowledge graph, the Resource
Description Framework (RDF) matches the FAIR
principles’ focus on identifiers and controlled terms
knowledge graph technologies support use cases for standardized datasets that are
designed to be connected
so:Genotype efo:Disease
kg:SMN1 kg:SMA
rdf:type rdf:type
ro:causes
condition
kg:SMN1 ro:causes_condition kg:SMA .
kg:SMA rdf:type efo:Disease .
kg:SMN1 rdf:type so:Genotype .
Prefixes
rdf: RDF specification
ro: Relations Ontology
so: Sequence Ontology
efo: Experimental Factor Ontology
kg: example “knowledge graph” namespace
11. 11
applications of semantic technologies: a bioinformatician, biologist, and chemist walk
into a bar
role user story
bioinformatician “I completed an analysis that includes a report with
my interpretations and tables of statistical model
output, and I want to publish these artifacts to our
data portal where my collaborators can examine
with self-service analytical tools.”
biologist “I am evaluating targets that were identified in a
bioinformatics analysis by reviewing different
sources of evidence, and I need to track the
information I am gathering and present a report to
my team.”
chemist “I received a prioritized list of potential targets for a
given disease from the target discovery team, and I
want to gather information about all compounds
that are known interactors with these targets.”
drug discovery
target discovery
target validation
12. 12
bioinformatics results are used to drive decision making and are managed as key
corporate assets
which genes are
differentially expressed
in this experiment?
collaborators
email data portal
• bioinformatics reports and datasets are treated as
peer-reviewed publications in a centralized data portal
• metadata about results are formal dataset descriptions
with a semantic model and controlled terminology
• analytics applications use microservices to drive data
visualizations and navigate connected datasets
challenge: many “artisanal” analyses are lost
in email, file servers, or messaging services
13. 13
semantic technology components supporting publication of bioinformatics results
• ontology terms define result
types and relationships
• provide canonical labels and
definitions
• designed using the protégé
editor and versioned in git
• analysts initialize a
templated project directory
and environment
• a dataset description is
generated using ontology-
driven tooling
• a validated dataset
description is published to
a central data portal
• metadata is added to a
search index
• tabular files accessed via a
data service api
target
constraint
violation
dataset description
• dataset descriptions are
modeled as a data graph
• the shape constraint
language is used to
validate the graph
14. 14
applications of semantic technologies: a bioinformatician, biologist, and chemist walk
into a bar
role user story
bioinformatician “I completed an analysis that includes a report with
my interpretations and tables of statistical model
output, and I want to publish these artifacts to our
data portal where my collaborators can examine
with self-service analytical tools.”
biologist “I am evaluating targets that were identified in a
bioinformatics analysis by reviewing different
sources of evidence, and I need to track the
information I am gathering and present a report to
my team.”
chemist “I received a prioritized list of potential targets for a
given disease from the target discovery team, and I
want to gather information about all compounds
that are known interactors with these targets.”
drug discovery
target discovery
target validation
15. 15
expert target evaluations are captured as data using structured evidence annotations
challenge: knowledge gained from literature
and database reviews are hidden in slide decks
does the evidence
support a therapeutic
hypothesis for my gene?
collaborators
slide deck web app
• electronic data capture app used to guide users
through a target evaluation protocol
• figures and visualizations embedded in a web app with
provenance information and evidence ontology codes
• structured annotations used to generate slide decks
and connect to related data using gene identifiers
16. 16
semantic technology components support structured annotation of target evaluations
• analytics app enables ranking genes
and drill down via detailed views
• organized to guide target evaluation
process w/access to evidence
sources
• free-text review, image,
rankings, and source url
for provenance
• semantic evidence codes
are used annotate each
review item
• structured target profiles enable
multiple representations
• target profile slide decks are auto-
populated with evidence reviews
• evaluating knowledge graph
models using nanopublications
and biolink
• data portal services
provide access to
results in apps
17. 17
applications of semantic technologies: a bioinformatician, biologist, and chemist walk
into a bar
role user story
bioinformatician “I completed an analysis that includes a report with
my interpretations and tables of statistical model
output, and I want to publish these artifacts to our
data portal where my collaborators can examine
with self-service analytical tools.”
biologist “I am evaluating targets that were identified in a
bioinformatics analysis by reviewing different
sources of evidence, and I need to track the
information I am gathering and present a report to
my team.”
chemist “I received a prioritized list of potential targets for a
given disease from the target discovery team, and I
want to gather information about all compounds
that are known interactors with these targets.”
drug discovery
target discovery
target validation
18. 1818
proprietary and shared data are integrated by incrementally expanding the knowledge
graph’s scope
challenge: heterogeneous organization of datasets
are prohibitively time consuming to integrate
What compounds interact
with this target and what
are their properties?
relational
database
graph
database
• significant results from internal analysis and target
reviews include cross references to external datasets
• publicly available gene models and chemical
compounds staged on maze data infrastructure
• solution enables integrated queries over proprietary
and shared data for quickly answering questions
collaborators
19. semantic technology components support integrated queries over proprietary and
shared data graphs
chembl rdfensembl rdf
• ensembl rdf represents
genomic features,
genomic locations and
cross-references
including to chembl
differential expression rdf
• differential expression
results are
transformed to rdf
using r2rml and linked
using gene identifiers
target review rdf
19
• target reviews are
linked via gene
identifiers to enable
integrated queries with
chembl and ensembl
• chembl rdf explicitly links
chemical, bioactivity, and
genomic data with cross-
references to other
databases
21. launched in 2019 with
$190m+ investment
based in south san francisco
with ~80 employees
founded on concept of
genetic modifiers
investors
21
translating genetic
modifying insights into
new therapeutics
Notas do Editor
Today I’ll be talking about how we are using knowledge graph technologies to profile drug targets in our discovery platform.
But since many of you are hearing about Maze for the first time, I’ll start with a brief overview of our company before diving into the technical part of the talk.
Maze was founded to answer this fundamental question… Why do some people get sick and others don’t, even when they have the same disease-causing gene?
Back in 2016 when our founders were first developing the concept of maze, a paper was published by the Resilience Project in which they identified individuals who should have a serious childhood disease, but did not.
Our founders asked the question… why?
So, it’s broadly known that there are genes that can protect people from certain diseases.
But they were curious if this type of insight into so called genetic modifiers could be used as a general platform for identifying therapeutic targets.
Around this same time, one of the maze founders, Jonathan Weissman, and his lab were developing novel applications of CRISPR to do genome wide gene / gene interaction studies
another maze founder Steve Elledge, who won the US equivalent of Nobel Prize for his work on DNA damage repair, was also interested in applying advanced functional genomics tools to study these gene / gene interactions
Both Jonathan and Steve were looking at how these tools could be used to kill cancer cells
HOWEVER, the interesting thing is that while Jonathan was looking for synthetic lethal combinations, he also found protective combinations that his lab described as having “buffering effects”.
Further building on the idea in the Resilience project that one could identify protective modifiers
What the maze team wanted to see if they could use Jonathan and Steve’s concepts to build a drug company
So as the early maze team started to build the company, the thought was that we could identify naturally protective variants from human genetic data
then we could generate proprietary functional genomics data to validate these protective modifiers and develop new therapies for severe genetically defined diseases
But what gave us the confidence to believe that this was a viable drugging strategy?
Well… It turns out that there was a drug approved in 2016 based on this exact idea for Spinal Muscular Atrophy, which is a horrible neuromuscular disease
The treatment was designed to increase SMN2 copy number which was found to help patients with SMN1 mutations that cause thus disease
So with this example in mind, our goal was to build a platform that could systematically identify and drug genetic modifiers for severe genetic diseases
Our approach was to develop a purpose build platform that integrates high-value human genetic and functional genomic data from public, commercial, and proprietary sources
Then conduct genome-wide crispr screens that can be used to understand the biology related to genetic modifiers
Once we’ve amassed a critical body of evidence, we can use what we’ve learned to focus our drug discovery efforts in a data-driven way
Now that you have some general context about maze, I’d like to switch gears and start unpack the importance of having integrated human genetic and functional genomic data
One of our goals in the data science group is to provide a 360° view of any evidence that can used associate a disease to potential targets
Over the past few years its been widely cited that disproportionate amount of a data scientists time goes to data preparation rather than doing analysis itself
And survey from the Anaconda 2020 State of Data Science report still indicates that data professions spend around 45% of their time on data management tasks
These data management tasks are essential and provide a foundation for the overall data science lifecycle, from preparation to analysis, visualization, and reporting.
With a foundation of integrated data, it enables teams to focus on work that more fully leverages their unique skillsets
We've been developing a data platform to provide this foundation of integrated data, which is summarize in this four-layer diagram
The cloud compute layer was designed in collaboration with the AWS Healthcare and Life Sciences group and is based on their Biotech Blueprint best practices architecture
The data persistence layer that is source of truth for archived data and metadata. This includes a suite of backend services as well as virtual sources hosted AWS that are registered into our data lake.
The data management layer follows a governance model based on the FAIR principles, ensuring that data and metadata are findable, accessible, interoperable, and reusable via web services
All of which are available to users via a data access layer provides web apps, command line utilities, and programmatic interfaces in R and python.
Today we'll be looking at how this platform is used to support three groups of users, specifically how it can be used to produce integrated data that follows the FAIR principles by using knowledge graph technologies.
First, there are several technologies that can be used to implemet knowledge graphs, but for our purposes we are using the Resource Description Framework and corresponding semantic technology stack.
Much of this decision comes from the fact that the semantic technology stack is based on mature standards that provides greater vender neutrality, meaning that the data model, query language, validation, and inference rules can plug and play in different dbs
I'm not going to go in depth regarding the technical detauls, but to give you an intuitive sense for what I'm talking about we can look at this simple example where we have a single statement, or triple, that asserts the SMN1 causes the condition SMA, spinal muscular atrophy
RDF then lets you expand on this with a type system where you can state that SMN1 is of type Geneotype and SMA is of type Disease.
Under the hood, this is written out in a simple text document with three columns, were these prefixes are used as shorthand for references to specific ontology terms or other identifiers.
There is much more to discuss here, but I want to move on to discussing how this can be applied in a few scenarios.
The three examples are based on real tasks that are completed by bioinformaticians in my data science group, biologists in our functional genomics group, and a computational chemist on our drug discovery team.
I chose a set of examples that walk us through early stage work from target discovery, to validation, and drug discovery, with an eye toward how these data can be linked together
I’ll briefly highlight each of these user stories before looking at each in more depth.
…
In our bioinformatics user story, a typical analyst’s workflow will primarily work with the Bioconductor open source tools in Rstudio to identify, for example, differentially expressed genes in a crispr screen
As part of this workflow, analyst use a rich data structure called a SummarizedExperiment where phenotypic, feature/gene information, count matrices, and experiment metadata are captured in a single data structure
With all content required for analysis in this SummarizedExperiement Object, analysts will generate a report with their interpretations that will be communicated to collaborators.
Traditionally, one of the core challenges here is that these “artisanal” analyses may not be formally tracked and captured as part of the corporate memory and rather lost to suboptimal communication channels
At maze, we’ve taken the approach to treating such analyses as key corporate assets that are described in a standardized way, published to a data portal, and accessible to downstream applications.
To implement this framework, we're developing an ontology using the protege editor that provides the controlled terms for our dataset descriptions
With this, analysts can create a standardized analysis environment and generate a dataset description of their results using ontology driven tools that produces a document using an RDF format called JSON-LD
These descriptions can then be validated using another part of the semantic technology stack called shacl. This can be used to generate a report of any violations, for example if terms are used to annotate results that are not part of the ontology.
Once a description is validated, it can be published alongside data files into our central data portal where metadata is added to a search index and a data api provides access to statistical results
Now lets take a look at the biologists user story where gene lists produced in the previous are researched in greater detail.
While not in the lab due to covid, I worked closely with bench scientists followup on a gene list oh hits from a crispr screen
Our goal was to survey the literature and online databases to gather evidence around a set of genes trying to understand their potential role in a given disease. The challenge is that much of the time the information gathered hidden away in a slide deck.
Rather than create a different slide deck of each gene, we developed an app to create a database of everything we were learning by following a target evaluation protocol
This information fed into the design of the experiments that are now being run in the lab and the evidence gathered is now part of our growing knowledge graph
In terms of implementation, the data portal provided access to the down stream analytics app that was used for target reviews
The analytics app provides tools for ranking genes based on different attributes and then examining detailed views that guide user through a target evaluation protocol
The target review app provides form based data entry that captures images, free text, and annotation with evidence codes from an ontology, for example, tagging a review as protein level expression.
The structured reviews collected can then be used to generate different views of the data collected. Initially these were templated powerpoint slides, but we are evaluationing other open models, such as nanotations and biolink, to organizing content like these reviews into a knowledge graph,
These in turn can be added to the data portal and fed back into analytics apps, creating a virtuous feedback loop.
Finally, well look at the user story of a chemist that is interested in cross referencing internal results with public compound databases.
One of most challenging part of these efforts is the amount of time it takes to align the schemas between datasets that were designed with a specific application in mind.
To lower the barriers to reuse and ad-hoc integration requests, we include cross references to external datasets in our analysis results and target review
We've also brought publicly available gene models and chemical compound into the maze data infrastructure and worked toward a solution that enables integrated queries for quickly answering questions that include our internal data
First, taking differential expression as an example, we use relational to rdf mapping language (r2rml) and related technologies to transform internal data into rdf
Similarly, we are leveraging a graph based representation of the aforementioned target reviews the include the same gene identifiers
As a proof of concept, we use the EBI distribution of both ensembl and chembl in RDF, which already provide mappings from genes, to transcripts, to proteines, to chembl targets.
By leveraging the built in properties of RDF, we were able to incrementally expand our knowledge graph with new facts and sources to cross reference our internal data with public sources
We’ve discussed three examples of where semantic technologies can be used to both capture and use FAIR data, but if we step back we can see a bigger picture that can emerge when following this general pattern to data management.
While structuring information this way enables a traditional pipeline for a single drug campaign, it can also enables synergies across different programs in a way that isn't usually available.
Many of our experimental insights are about learning generalizable techniques, reagent properties, dosing conditions, etc, that can be valuable to other programs.
When you network information across experiments and system like this, you are also implicitly gathering practical insights that you can immediately use in other contexts.
By identifying use cases and incrementally building the maze knowledge graph over time, my hope is that this network of data spanning from target discovery, validation, and drug discovery will help us identify the right patients for the therapies we are developing.
This is maze. We launched the company in 2019, and are pursuing a novel approach to drug development. Raised $191M from group of experienced investors. Strong team, around 75 employees.