CNIC Information System with Pakdata Cf In Pakistan
Neo4j for Discovering Drugs and Biomarkers
1. — CONFIDENTIAL—
MICROBIOME TO MEDICINE™
Helios2(Neo4j) for Discovering Drugs and Biomarkers
Satish Viswanatham, Head of Data Engineering
Brendan, Cesar, Divya, Jin and Richard
2. CONFIDENTIAL
Outline of the talk
• Technical Terms will be explained briefly as they are encountered
• Links provided
• Why Microbiome?
• Challenges in Microbiome data
• High Level Architecture
• Implementation Highlights
• Future Work
• Lastly, more examples from the industry.
2
3. CONFIDENTIAL
The microbiome is a rich source of biomarkers
and potent bacterial peptides
3
Glucose/
lipids
BIOLOGICAL FUNCTIONS INFLUENCED: 100
• Untapped library of novel drugs
• Rich data source of
host:microbial interactions
• New “organ” to re(de)fine
patients and medical practice
*PMID:31415755. Compare to 25,000 human genes
GI health
Immune function
Metabolism
Pathogens
TRILLION BACTERIA!
>25,000,000 genes*
Cancer
4. CONFIDENTIAL
The sg-4sight Platform Summary
We built SG-4Sight to
• Collect Clinical microbiome data
• Conduct multi-technology (16S, MTT/MTG) meta-analysis (diff. abundance)
• Find bacterial biomarkers (Gene, Strain, Peptide, ...)
• Select bacterial polypeptide therapeutic candidates in a data-driven manner
• Efficiently prepare and screen them through in vitro and in vivo models of
disease
• Lastly, to find their human targets by which they stimulate the therapeutic
effect.
4
5. — CONFIDENTIAL—
MICROBIOME TO MEDICINE™
sg-4sight platform
Federated Data Engine - SGKnowledgeBase (Helios/Neo4J, Buho/Athena, …)
7. — CONFIDENTIAL—
*Data Engine was continually evolving as new technology was added so each program over time was analyzed according to the current status of our Data Engine.
SG KnowledgeBaseTM
: is a proprietary database that organizes -omics data and clinical metadata for systematic mining AWS: Amazon Web Services; sg-4sight is proposed platform name along with
multiple variations submitted for trademark approval; MS: Mass Spectrometry
7
Our sg-4sight tech-powered drug discovery engine is
built to disrupt drug discovery
8. CONFIDENTIAL
CONFIDENTIAL
Why Neo4J
8
• Flexible Schema - NoSQL
• Graph Queries
• Easy to learn Cypher Query Language: Less Learning Curve
• Query performance > SQL
• 1000 times faster
• Community Edition
• Neo4J was used for another experimental project
• Great Community!
9. CONFIDENTIAL
CONFIDENTIAL
Data from multiple clinical sources are compiled in the
SGKnowledgeBase for powerful cross-cohort discovery
9
Second Genome
Proprietary Datasets
Metadata
Standardization
&
Data Quality
Control/Sanity
Checks
&
Custom Data
Loaders
Public datasets
Second Genome
KnowledgeBase
& Helios2
Odessa (Django)
Constraint/Sanity
checks
Vocab/Onto
Data Loading
10. CONFIDENTIAL
CONFIDENTIAL
Helios Nodes and Relationships
10
Node Label
Node
Count
Average Number of
Relationships
Dataset Any Millions
Phage_display Any Millions
Meta_analysis Any High Thousands
Meso_scale_discovery Any Thousands
... Any Hundreds
Bin Thousands Any
NCBI_assembly_accession Thousands Any
Strain Thousands Any
Peptide Millions Any
Our schema is centered around
peptides:
● With every experiment we add
the knowledge around that
protein.
● Every mtma, lab assay, and
phage display adds more
information on how the
peptide looks in a set of
published studies, an immune
assay, or a binding assay
11. CONFIDENTIAL
CONFIDENTIAL
● Connects high-throughput past observations to accelerate future
discovery
○ Between microbial peptides and host cells
○ Between microbial taxa and disease states
○ Between microbial functional genes and disease states
● Enables discovery of common peptide features which predict a desired
functionality.
Helios is the largest known database of interactions
11
12. CONFIDENTIAL
CONFIDENTIAL
What we built in Helios2?
12
• DevSecOps - Data Confidentiality Controls
• Partial Updates
• Constraint System
• Two Phase Commits
• Automatic Backups
• Weekly, Daily and Monthly to a remote region
• Security/SSL, Logs to Fluentd
• Domain Name & AWS Security Group Via CloudFormation
• DevOps - Alerts
13. CONFIDENTIAL
CONFIDENTIAL
Future
13
The design of Helios and the underlying Neo4j graph-database allows for the easy
integration of additional layers of biomedical data, such as
• pharmacological action of drugs
• non-small molecule drugs
• disease information
• target development categories
• Schema optimizations!
• Labels vs properties, Super nodes
We also intend to integrate more cheminformatics and network analysis features
into the platform in the future.
14. CONFIDENTIAL
CONFIDENTIAL
● Also we want to give a shout out to CKG project (Clinical Knowledge Graph) for
uploading a dump of their database that can be used to easily create a Neo4
graph database harmonizing 9 ontologies, 26 relevant biomedical databases.
Experimental studies included in the publication are also included as CKG
reports.
○ https://ckg.readthedocs.io/en/latest/project_report/project-report.html
● https://reactome.org/dev/graph-database/extract-participating-molecules
● https://neo4j.com/blog/integrating-biology-public-neo4j-database/
● https://cytoscape.org/what_is_cytoscape.html
● https://www.researchgate.net/publication/304407871_Using_Neo4j_for_Mini
ng_Protein_Graphs_A_Case_Study (PPI paper)
● https://link.springer.com/article/10.1186/s13321-020-0409-9
References
14
15. CONFIDENTIAL
CONFIDENTIAL
● Thanks for your time.
○ https://www.secondgenome.com/development-platform
○ Our turnkey platform - parterning@secodngenome.com
● Please email if you want to continue to the conversation:
satish@secondgenome.com
● Second Genome is proud to be named to a Top 10 Best Places to
Work in Biopharma
● We are hiring!
○ https://www.secondgenome.com/culture-careers/careers
Q&A
15