# Automating Data Science over a Human Genomics Knowledge Base
Radouane Oudrhiri, the CTO of Eagle Genomics, will talk about how Eagle Genomics is building a platform for automating data science over a human genomics knowledge base. Rad will dive into the architecture Eagle Genomics and also discuss how Grakn serves as the knowledge base foundation of the system. Rad also give a brief history of databases, semantic expressiveness and how Grakn fits in the big picture.
# Radouane Oudrhiri, CTO, Eagle Genomics
Radouane has an extensive experience in leading world-class software and data-intensive system developments in different industries from Telecom to Healthcare, Nuclear, Automotive, Financials. Radouane is Lean/Six Sigma Master Black Belt with speciality in high-tech, IT and Software engineering and he is recognised as the leader and early adaptor of Lean/Six Sigma and DFSS to IT and Software. He is a fellow of the Royal Statistical Society (RSS) and member of the ISO Technical Committee (TC69: Applications of Statistical methods) where he is co-author of the Lean & Six Sigma Standard (ISO 18404) as well as the new standard under development (Design for Six Sigma). He is also part of the newly formed international Group on Big Data (nominated by BSI as the UK representative/expert). Radouane has also been Chair of the working group on Measurement Systems for Automated Processes/Systems within the ISPE (International Society for Pharmaceutical Engineering).
Automating Data Science over a Human Genomics Knowledge Base
1. Towards a framework for automating the
Data Scientist – application to life science
and bio data
Radouane Oudrhiri, Chief Data Scientist
Monday 27th February 2017
radouane.oudrhiri@eaglegenomics.com
27 February 2018
Cognitive & AI Data Infrastructure Meetup
2. Table of content
• Eagle Genomics - introduction
• BioPharma industry – data-driven innovation
• Challenges and bottleneck
• The manual data process
• Principles and concepts
• Data linkage & associated models
• Value of data and information
• The (Machine) Learning approach and mechanism
• Functional Architecture
• The Data layer
• Summary
3. About Eagle Genomics
Based in Cambridge, UK since 2008, on the Wellcome Genome Campus
Smart data management for Life Sciences - software & services
• Human & animal health
• Personal care and cosmeceuticals
• Food and nutriceuticals
Delivering the innovation platform for the genomics era:
e[automateddatascientist]
• to increase the success rate of innovation
• to enable data driven decisions
• to enable customers to become insight driven
6. Table of content
• Eagle Genomics - introduction
• BioPharma industry – data-driven innovation
• Challenges and bottleneck
• The manual data process
• Principles and concepts
• data linkage & associated models
• Value of data and information
• The (Machine) Learning approach and mechanism
• Functional Architecture
• The Data layer
• Summary
7. BioPharmaceutical is evolving in pockets
Driven by precision medicine and high throughput technologies
Data-driven innovation is a must
• Must be designed, aligned with strategy and continuously
adapted
• Requires a deep cultural change to liberate the business
opportunity
Data-intensive systems and processes are the business!
• this goes way beyond digitisation
• data is the currency
• The technical challenges of data-intensive systems are
stretching classical system engineering approaches
Urgent need for comprehensive strategy to manage data assets!
“Software Data is eating the world” *
(*) ANDREESSEN M., “Software Is Eating the World”, The Wall Street Journal Essay, August 20, 2011.
http://online.wsj.com/article/SB10001424053111903480904576512250915629460.html
10. The bottleneck to data-driven innovation and data governance
AI
Machine
Learning
Modelling
Focus of the industry for AI + ML applications
necessary but not sufficient
effort
value
11. Accelerating data driven innovation
distributed
increased
Eagle Genomics focus - across the entire bridge
12. Manual biocuration and data tagging is complex, unstructured and time consuming
example of microbiome + clinical data
13. Table of content
• Eagle Genomics - introduction
• BioPharma industry – data-driven innovation
• Challenges and bottleneck
• The manual data process
• Principles and concepts
• Data linkage & associated models
• Value of data and information
• The (Machine) Learning approach and mechanism
• Functional Architecture
• The Data layer
• Summary
15. Solving the crucial data linkage problem:
Questions, Knowledge, and Experimental Process Models
How the data was
collected?
Process Modelling Knowledge Modelling
What does the
data represent?
Questions Modelling and mapping
Why the data was
generated?
Semantic enrichment
Data sources
Map to process
graph
Map to entities
of interest
Valuation
hierarchies
eaglediscover
eaglecurate
Why the data will be
generated?
16. Value of data and information - the missing link
Data Information
Value
Insight
E[VCp1 p2 …n|e] = E[v|CVp1 2 pne] – E[VS1S2…Sn|e]
Howard, R.A. (1966). Information value theory. IEEE Transactions on Systems Science and
Cybernetics (SSC-2), 22-26.
• Semantical mapping processes, from one level
of abstraction to another level, surrounded by
ambiguity and uncertainty
• According to the Information Theory,
Information is defined as “reduction of
uncertainty”
• The Value of Information is the price that one would pay
a “Clairvoyant” for additional information to reduce
risks and uncertainties at each stage of a study, so as to
increase profit
17. Value of data and information
Two factors determine the value of information:
1. whether the information is new to you;
2. whether the information causes you to change your decisions
Consequences:
The value of information is a subjective but quantitative utility that is
realised at decision time
The value of data/information is defined by its use and/or intended use
18. Data valuation is a conversational process among multiple stakeholders
Scientists Bioinformaticians
Data scientists
Marketing
Business
leaders
Business
leaders
Measure concordance/discordance among
stakeholders
Use the metrics as a means to reduce
ambiguity and reach consensus
19. The value of data/information is multidimensional
Data Valuation is a prioritisation process
Seeking the Pareto effect
20. Data quality versus data value
V
a
l
u
e
H
i
g
H
Missed
opportunities
Ideal situation
L
o
w
???
over-
engineered
Low High
Quality
Data value and data
quality are correlated
and follow a Pareto
distribution
Most organizations,
curate what is easy
rather than what is
necessary
21. 7 laws of data asset management
Information Is
(Infinitely)
Shareable
1
The Value of
information Increases
with use
2
The Value of information
increases when combined
with other information
5
More is not
necessarily better
Smart not
necessarily Big
6
Information is
not depletable Information is
self-generating,
the more you use
it, the more you
have
7
4
QUALITY
The Value of Information
increases with Quality
Moody D., Walsh P., (1999), Measuring The Value Of Information: An Asset Valuation Approach.
Seventh European Conference on Information Systems (ECIS’99)
Information is
perishable
3
3
22. Building valuation models
Definition of a multi-dimensional
metadata value model
2
Mapping Questions to
meta-data value model
(intended use)
3
(atomic level)
Automated mapping of
value to datasets
4
Data Value Exploitation
5
Model
calibration
Questions Definition &
Context
1
pairwise &
hierarchical
scholar
citations
… … probabilistic
& statistical
multi-scales
system
Model
calibration
Model
calibration
23. Table of content
• Eagle Genomics - introduction
• BioPharma industry – data-driven innovation
• Challenges and bottleneck
• The manual data process
• Principles and concepts
• Data linkage & associated models
• Value of data and information
• The (Machine) Learning approach and mechanism
• Functional Architecture
• The Data layer
• Summary
24. The Learning Journey…
Meta-data, ontologies
and templates based
Rule-based
AI, Machine Learning and
Deep Learning
Limitations: not flexible
Learnings: process structures and patterns
Limitations: does not scale up, requires continual
change
Learnings: heuristics and constraints
Limitations: requires large
amount of data and variation of
experiments
Experts in the loop and
Adversarial Learning
Provides more flexibility and
scalability
25. Value-driven automated data curation and tagging process
Primary Experimental
data-sets
Questions & Goals
X
1 -Represent data as
an experimental
process
2- Represent
questions as
experimental
processes
3 - Cross-map
4 - Enrich
26. 1 - Representation of data as experimental process models
Primary Experimental data-sets
Meta-data model
Meta model
Experimental Data
Process Pattern
Tagging and Categorisation
principles
Experimental Data represented as a typed Process Graph.
Identification of missing process components from
experimental process patterns and models
Autocuration Engine: Semantical enrichment and Context Mesh Entailment
present
absent
Process-oriented representation of
Experimental data-sets
Process element
asset
Graph theory and algorithms
Topologically: highly constrained
Learning for information
representation
27. 2 - Mapping Questions to Experimental Process Representations
Autocuration Engine: Semantical
enrichment and Context Mesh Entailment
Questions & Goals
Process-oriented
representation of
Experimental data-sets
a) Mapping of the questions to Process-oriented
graph
b) Mapping of question to data (Experimental
process)
c) Identification of gaps
28. 3 - Cross mapping and Identification of enrichment data sources based on
Value
Autocuration Engine: Semantical
enrichment and Context Mesh Entailment
Publications and
references
Meta-data
- Ad hoc
- Nomenclatures
- Ontologies
Internal & External sources
Use e[discover] to identify and select sources of
data enrichment based on added value
questions
29. 4 - Semantical enrichment as: Inductive data weaving and context mesh
entailment
Autocuration Engine: Semantical
enrichment and Context Mesh Entailment
Publications and references
Meta-data
- Ad hoc
- Nomenclatures
- Ontologies
Internal & External sources
30. Table of content
• Eagle Genomics - introduction
• BioPharma industry – data-driven innovation
• Challenges and bottleneck
• The manual data process
• Principles and concepts
• Data linkage & associated models
• Value of data and information
• The (Machine) Learning approach and mechanism
• Functional Architecture
• The Data layer
• Summary
31. Requirements for data modelling and management
• Graph data structure and models are a natural fit
Networks Science
• Support for Multi-layered graphs (“Lasagne graph”)
• Rich semantic expressiveness
• Flexible and dynamic (meta) modelling
• Multidimensional relationships (n-ary) and not just binary
• Support of integrity constraints
• Language for both graph traversing (navigation) and computation (optimisation)
• Verifiability or at least ease of verification
32. A brief history of databases and models – key concepts
Hierarchical data model
1960’s C. Backman
IBM 1968 IDS
Relational Model
- Algebraic
- Normalisation
- Functional
dependencies
T. Codd 1970’s
Entity-Relationship
- n-ary & reflexive
relationships
- Cardinalities 1976-
77 P. Chen & H.
Tardieu
RDBMS and SQL
- Data independence
(logical, physical)
- IBM DB2 1980’s
UML
1994
Graph
databases
OODBMS
O2,
early 90’s
OO
- Class and
subclass
O. Dahl, 1967
Semi structured,
XML, complex
data types
Giant datastores
Google, Yahoo,
Amazon
XML and
XQuery
early 2000’s
late 2000’s
Functional Data
Model
- Data mining,
OLAP
- Functional
dependencies
ER+
- integrity
constraints
- Generalisation
- specialisation
- Genericity
AFCET group
Early 90’s
33. Eagle Genomics platform functional architecture
Data access layer
Adaptors
Factory
Extract
Load
Qualitycontrol&
cleansing
Staging
e[catalog]
Data Catalog & Model Management
Graphbuilder
(transform)
Questions
Valuation Models:
Relevance
(sources + entities)
Risk quantification
Data Quality
Model
builder
Enrich
Weave/Entai
l
e[discover
]
Data
Sources
Discovery
Data
storage
Data finding
& selection
*
*
*
adaptors
# #
Changemanagementandaudittrail
WorkflowManager–e[hive]
SecurityManagement
AI and ML functionalities
e[curate
]
Conversational Learning Interface
34. Semantics Engine key modules
Data access layer
Adaptors
Factory
Questions
Valuation Models:
Relevance
(sources + entities)
Risk quantification
Data Quality
Model
builder
e[discover
]
*
#
AI and ML functionalities
Metadata Ontologie
s
Reference
data
e[catalog]
Data Catalog & Model Management
35. Data Access and Management Layer
• Provides a unification layer for data access and
management
• Allows for Distributed and Federated Database
• Structured data is added, with schema, to the
Ingest API, and structured queries are handled
through the Search API
• Data is stored in data stores
• High level queries use schemas and ontologies to
be mapped to database level ones, which
translate to direct database requests
• Schema mapping enables data integration at this
level
• Materialization Engine persist results of most
used queries as well as expected queries to
database, thus catering for efficiency and
scalability
• Ensures efficient and scalable access to
structured data
• Scalability by design-in by opposition to test-in
Ingest APIBulk Data APIMetadata APISearch API
Caching
RDBMS GraphDB / Ontology Store FileStore
Materialization Engine
Graql
Grakn DB
36. Table of content
• Eagle Genomics - introduction
• BioPharma industry – data-driven innovation
• Challenges and bottleneck
• The manual data process
• Principles and concepts
• Data linkage & associated models
• Value of data and information
• The Learning approach and mechanism
• Functional Architecture
• The Data layer
• Summary
37. The Future: Automating the Data Scientist, providing
data-science-as-a-service
Raj
Bioinformatician
R&D Lead
Jennifer
Biologist
Tony
Director,
Scientific Innovation
Conduct
experiment
Ask questions
Determine
investigation /
studies
View /refine
Generate source
report Analyze all
data
Generate study
report
Think of new
questions /
studies
• Reduce inertia
• Increase speed to insight
• Reduce time & cost
• Leverage stranded data assets
• Data science at the fingertips
of the biologist
Clarify
goals
Analyze data
Refine / ask new questions
Load into
e[curate]
Raw data
Value with e[dicover]
Load into
e[curate]
Load into
e[curate]
Generate
map
Generate
report data
External data sources
Ensembl, Pubmed, UniProt, ClinicalTrials.gov etc.
40. Screenshot from Eagle Platform Demo:
Semantic enrichment and contextual data tagging
41. Screenshot from Eagle Platform Demo:
Question Recommendations and Formulation, based on Previous Analyses/Studies
42. Screenshot from Eagle Platform Demo:
Key Entities, Relationships, and Associated Evidence
43. Unilever: Eagle’s Platform accelerating project
timelines across global teams
“Unilever’s digital data program now processes genetic sequences
twenty times faster - without incurring higher compute costs. In
addition, its robust architecture supports ten times as many
scientists, all working simultaneously.”
Pete Keeley, eScience Technical Lead
- R&D IT at Unilever
Past: Stand alone Present: Server/Cluster
Emerging: Secure Scalable
Cloud
Weeks / Months Days / Weeks Hours
100K Reads 5 Billion reads
44. Model Builder and Metamodel management
• Generic meta models are more
evolvable but lack semantic
expressiveness and are hard to
implement (performance)
dynamic ontologies
• Specific models are semantically
richer but tend to be rigid
static ontologies
Model builder solving the dilemma of
performance versus genericity
• Analogue to the blending of a
compiler and interpreter
• Virtual machine for models
Connectors & Wrappers Factory
Model builder
Generic Meta model & ontologies
Semantically-
richspecific
models
45. Ontologies and Ontologies Management
Ontologies are necessary components, but not
sufficient for semantics resolutions and
management
• Multitude of ontologies
• Conflict between ontologies (incoherence)
• Noisy ontologies (inconsistence)
From static to dynamic ontologies
• Managing ontologies as assets
• Valuation models applied to ontologies based
on questions and context
• Use of Machine Learning to construct
dynamically ontologies