Automating Data Science over a Human Genomics Knowledge Base

Towards a framework for automating the
Data Scientist – application to life science
and bio data
Radouane Oudrhiri, Chief Data Scientist
Monday 27th February 2017
radouane.oudrhiri@eaglegenomics.com
27 February 2018
Cognitive & AI Data Infrastructure Meetup

Table of content
• Eagle Genomics - introduction
• BioPharma industry – data-driven innovation
• Challenges and bottleneck
• The manual data process
• Principles and concepts
• Data linkage & associated models
• Value of data and information
• The (Machine) Learning approach and mechanism
• Functional Architecture
• The Data layer
• Summary

About Eagle Genomics
Based in Cambridge, UK since 2008, on the Wellcome Genome Campus
Smart data management for Life Sciences - software & services
• Human & animal health
• Personal care and cosmeceuticals
• Food and nutriceuticals
Delivering the innovation platform for the genomics era:
e[automateddatascientist]
• to increase the success rate of innovation
• to enable data driven decisions
• to enable customers to become insight driven

The Eagle Genomics journey:
from services, to solutions, to platform

Table of content
• data linkage & associated models
• The (Machine) Learning approach and mechanism
• The Data layer
• Summary

BioPharmaceutical is evolving in pockets
Driven by precision medicine and high throughput technologies
Data-driven innovation is a must
• Must be designed, aligned with strategy and continuously
adapted
• Requires a deep cultural change to liberate the business
opportunity
Data-intensive systems and processes are the business!
• this goes way beyond digitisation
• data is the currency
• The technical challenges of data-intensive systems are
stretching classical system engineering approaches
Urgent need for comprehensive strategy to manage data assets!
“Software Data is eating the world” *
(*) ANDREESSEN M., “Software Is Eating the World”, The Wall Street Journal Essay, August 20, 2011.
http://online.wsj.com/article/SB10001424053111903480904576512250915629460.html

Current Process is Entangled, Human Intensive and Inefficient

The fundamental requirements for data-driven innovation

The bottleneck to data-driven innovation and data governance
AI
Machine
Learning
Modelling
Focus of the industry for AI + ML applications
necessary but not sufficient
effort
value

Accelerating data driven innovation
distributed
increased
Eagle Genomics focus - across the entire bridge

Manual biocuration and data tagging is complex, unstructured and time consuming
example of microbiome + clinical data

How biocurators, scientists and subject matter experts work

Solving the crucial data linkage problem:
Questions, Knowledge, and Experimental Process Models
How the data was
collected?
Process Modelling Knowledge Modelling
What does the
data represent?
Questions Modelling and mapping
Why the data was
generated?
Semantic enrichment
Data sources
Map to process
graph
Map to entities
of interest
Valuation
hierarchies
eaglediscover
eaglecurate
Why the data will be
generated?

Value of data and information - the missing link
Data Information
Value
Insight
E[VCp1 p2 …n|e] = E[v|CVp1 2 pne] – E[VS1S2…Sn|e]
Howard, R.A. (1966). Information value theory. IEEE Transactions on Systems Science and
Cybernetics (SSC-2), 22-26.
• Semantical mapping processes, from one level
of abstraction to another level, surrounded by
ambiguity and uncertainty
• According to the Information Theory,
Information is defined as “reduction of
uncertainty”
• The Value of Information is the price that one would pay
a “Clairvoyant” for additional information to reduce
risks and uncertainties at each stage of a study, so as to
increase profit

Value of data and information
Two factors determine the value of information:
1. whether the information is new to you;
2. whether the information causes you to change your decisions
Consequences:
The value of information is a subjective but quantitative utility that is
realised at decision time
The value of data/information is defined by its use and/or intended use

Data valuation is a conversational process among multiple stakeholders
Scientists Bioinformaticians
Data scientists
Marketing
Business
leaders
Business
leaders
Measure concordance/discordance among
stakeholders
Use the metrics as a means to reduce
ambiguity and reach consensus

The value of data/information is multidimensional
Data Valuation is a prioritisation process
Seeking the Pareto effect

Data quality versus data value
V
a
l
u
e
H
i
g
H
Missed
opportunities
Ideal situation
L
o
w
???
over-
engineered
Low High
Quality
Data value and data
quality are correlated
and follow a Pareto
distribution
Most organizations,
curate what is easy
rather than what is
necessary

7 laws of data asset management
Information Is
(Infinitely)
Shareable
1
The Value of
information Increases
with use
2
The Value of information
increases when combined
with other information
5
More is not
necessarily better
Smart not
necessarily Big
6
Information is
not depletable Information is
self-generating,
the more you use
it, the more you
have
7
4
QUALITY
The Value of Information
increases with Quality
Moody D., Walsh P., (1999), Measuring The Value Of Information: An Asset Valuation Approach.
Seventh European Conference on Information Systems (ECIS’99)
Information is
perishable
3
3

Building valuation models
Definition of a multi-dimensional
metadata value model
2
Mapping Questions to
meta-data value model
(intended use)
3
(atomic level)
Automated mapping of
value to datasets
4
Data Value Exploitation
5
Model
calibration
Questions Definition &
Context
1
pairwise &
hierarchical
scholar
citations
… … probabilistic
& statistical
multi-scales
system
Model
calibration
Model
calibration

The Learning Journey…
Meta-data, ontologies
and templates based
Rule-based
AI, Machine Learning and
Deep Learning
Limitations: not flexible
Learnings: process structures and patterns
Limitations: does not scale up, requires continual
change
Learnings: heuristics and constraints
Limitations: requires large
amount of data and variation of
experiments
Experts in the loop and
Adversarial Learning
Provides more flexibility and
scalability

Value-driven automated data curation and tagging process
Primary Experimental
data-sets
Questions & Goals
X
1 -Represent data as
an experimental
process
2- Represent
questions as
experimental
processes
3 - Cross-map
4 - Enrich

1 - Representation of data as experimental process models
Primary Experimental data-sets
Meta-data model
Meta model
Experimental Data
Process Pattern
Tagging and Categorisation
principles
Experimental Data represented as a typed Process Graph.
Identification of missing process components from
experimental process patterns and models
Autocuration Engine: Semantical enrichment and Context Mesh Entailment
present
absent
Process-oriented representation of
Experimental data-sets
Process element
asset
Graph theory and algorithms
Topologically: highly constrained
Learning for information
representation

2 - Mapping Questions to Experimental Process Representations
Autocuration Engine: Semantical
enrichment and Context Mesh Entailment
Questions & Goals
Process-oriented
representation of
Experimental data-sets
a) Mapping of the questions to Process-oriented
graph
b) Mapping of question to data (Experimental
process)
c) Identification of gaps

3 - Cross mapping and Identification of enrichment data sources based on
Value
Publications and
references
Meta-data
- Ad hoc
- Nomenclatures
- Ontologies
Internal & External sources
Use e[discover] to identify and select sources of
data enrichment based on added value
questions

4 - Semantical enrichment as: Inductive data weaving and context mesh
entailment
Publications and references
Meta-data
- Ad hoc
- Nomenclatures
- Ontologies
Internal & External sources

Requirements for data modelling and management
• Graph data structure and models are a natural fit
 Networks Science
• Support for Multi-layered graphs (“Lasagne graph”)
• Rich semantic expressiveness
• Flexible and dynamic (meta) modelling
• Multidimensional relationships (n-ary) and not just binary
• Support of integrity constraints
• Language for both graph traversing (navigation) and computation (optimisation)
• Verifiability or at least ease of verification

A brief history of databases and models – key concepts
Hierarchical data model
1960’s C. Backman
IBM 1968 IDS
Relational Model
- Algebraic
- Normalisation
- Functional
dependencies
T. Codd 1970’s
Entity-Relationship
- n-ary & reflexive
relationships
- Cardinalities 1976-
77 P. Chen & H.
Tardieu
RDBMS and SQL
- Data independence
(logical, physical)
- IBM DB2 1980’s
UML
1994
Graph
databases
OODBMS
O2,
early 90’s
OO
- Class and
subclass
O. Dahl, 1967
Semi structured,
XML, complex
data types
Giant datastores
Google, Yahoo,
Amazon
XML and
XQuery
early 2000’s
late 2000’s
Functional Data
Model
- Data mining,
OLAP
- Functional
dependencies
ER+
- integrity
constraints
- Generalisation
- specialisation
- Genericity
AFCET group
Early 90’s

Eagle Genomics platform functional architecture
Data access layer
Adaptors
Factory
Extract
Load
Qualitycontrol&
cleansing
Staging
e[catalog]
Data Catalog & Model Management
Graphbuilder
(transform)
Questions
Valuation Models:
Relevance
(sources + entities)
Risk quantification
Data Quality
Model
builder
Enrich
Weave/Entai
l
e[discover
]
Data
Sources
Discovery
Data
storage
Data finding
& selection
*
*
*
adaptors
# #
Changemanagementandaudittrail
WorkflowManager–e[hive]
SecurityManagement
AI and ML functionalities
e[curate
]
Conversational Learning Interface

Semantics Engine key modules
Data access layer
Adaptors
Factory
Questions
Valuation Models:
Relevance
(sources + entities)
Risk quantification
Data Quality
Model
builder
e[discover
]
*
#
AI and ML functionalities
Metadata Ontologie
s
Reference
data
e[catalog]
Data Catalog & Model Management

Data Access and Management Layer
• Provides a unification layer for data access and
management
• Allows for Distributed and Federated Database
• Structured data is added, with schema, to the
Ingest API, and structured queries are handled
through the Search API
• Data is stored in data stores
• High level queries use schemas and ontologies to
be mapped to database level ones, which
translate to direct database requests
• Schema mapping enables data integration at this
level
• Materialization Engine persist results of most
used queries as well as expected queries to
database, thus catering for efficiency and
scalability
• Ensures efficient and scalable access to
structured data
• Scalability by design-in by opposition to test-in
Ingest APIBulk Data APIMetadata APISearch API
Caching
RDBMS GraphDB / Ontology Store FileStore
Materialization Engine
Graql
Grakn DB

Table of content
• Data linkage & associated models
• The Learning approach and mechanism
• The Data layer
• Summary

The Future: Automating the Data Scientist, providing
data-science-as-a-service
Raj
Bioinformatician
R&D Lead
Jennifer
Biologist
Tony
Director,
Scientific Innovation
Conduct
experiment
Ask questions
Determine
investigation /
studies
View /refine
Generate source
report Analyze all
data
Generate study
report
Think of new
questions /
studies
• Reduce inertia
• Increase speed to insight
• Reduce time & cost
• Leverage stranded data assets
• Data science at the fingertips
of the biologist
Clarify
goals
Analyze data
Refine / ask new questions
Load into
e[curate]
Raw data
Value with e[dicover]
Load into
e[curate]
Load into
e[curate]
Generate
map
Generate
report data
External data sources
Ensembl, Pubmed, UniProt, ClinicalTrials.gov etc.

www.eaglegenomics.com
Thank you
Q&A
radouane.odrhiri@eaglegenomics.comCognitive & AI Data Infrastructure Meetup

Screenshot from Eagle Platform Demo:
Data valuation ad prioritisation

Semantic enrichment and contextual data tagging

Question Recommendations and Formulation, based on Previous Analyses/Studies

Key Entities, Relationships, and Associated Evidence

Unilever: Eagle’s Platform accelerating project
timelines across global teams
“Unilever’s digital data program now processes genetic sequences
twenty times faster - without incurring higher compute costs. In
addition, its robust architecture supports ten times as many
scientists, all working simultaneously.”
Pete Keeley, eScience Technical Lead
- R&D IT at Unilever
Past: Stand alone Present: Server/Cluster
Emerging: Secure Scalable
Cloud
Weeks / Months Days / Weeks Hours
100K Reads 5 Billion reads

Model Builder and Metamodel management
• Generic meta models are more
evolvable but lack semantic
expressiveness and are hard to
implement (performance)
 dynamic ontologies
• Specific models are semantically
richer but tend to be rigid
 static ontologies
Model builder solving the dilemma of
performance versus genericity
• Analogue to the blending of a
compiler and interpreter
• Virtual machine for models
Connectors & Wrappers Factory
Model builder
Generic Meta model & ontologies
Semantically-
richspecific
models

Ontologies and Ontologies Management
Ontologies are necessary components, but not
sufficient for semantics resolutions and
management
• Multitude of ontologies
• Conflict between ontologies (incoherence)
• Noisy ontologies (inconsistence)
From static to dynamic ontologies
• Managing ontologies as assets
• Valuation models applied to ontologies based
on questions and context
• Use of Machine Learning to construct
dynamically ontologies

Automating Data Science over a Human Genomics Knowledge Base

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automating Data Science over a Human Genomics Knowledge Base

Similar to Automating Data Science over a Human Genomics Knowledge Base (20)

More from Vaticle

More from Vaticle (20)

Recently uploaded

Recently uploaded (20)

Automating Data Science over a Human Genomics Knowledge Base