SlideShare a Scribd company logo
1 of 24
Download to read offline
Is one enough?
Data warehousing for biomedical research
Gregory Landrum1, Matthias Wrobel2, Nicholas Clare2
1 KNIME.com AG
2 Novartis Institutes for BioMedical Research, Basel
2016 Basel Life Sciences Week
21 September 2016
Overview
§ Motivation: why is this both important and hard?
§ Three data warehouse case studies
§ Is one enough?
2
Challenges for real world data management and
analysis
§ Lots of heterogeneous data from multiple sources, both
internal and external
§ Source data are frequently messy and unstructured
§ Constant flow of new data into the system
§ Diverse stakeholders and users
§ Highly diverse and complex questions to ask of the data
§ Serious performance requirements
Storing and managing real-world data
§ Warehouse vs mart vs federation vs “data lake” vs linked
data/triple store vs …
§ Many, many different approaches, technologies, and
architectures.
§ Most are applicable in some scenarios but there is no
silver bullet.
Stonebraker, Michael, and Uğur Çetintemel. ”One size fits all": an idea whose
time has come and gone. Proceedings, 21st International Conference on Data
Engineering, 2005. ICDE 2005. IEEE, 2005.
https://cs.brown.edu/~ugur/fits_all.pdf
Storing the data isn’t the end of the story
You probably want to be able to get the data back out.
Storing the data isn’t the end of the story
http://flickr.com/photos/35703177@N00/1063555182
You probably want to be able to get the data back out.
Extracting
insights from a
data lake
Jokes aside, allowing the data to be queried
and retrieved efficiently is essential
Nature of the data
Shape of the data generated for a project
Hit finding
106 rows, 1-2 columns
Hit-to-lead
103 rows, 5-10 columns
Lead optimization
102 rows, 102 columns Clinic
1 rows, 104 columns
“omics” data (can appear at multiple stages) is different still
Query/report type 1:
Specialized search
Many rows
Few columns
Query/report type 2:
Basic search
Few rows
Many columns
It’s more than just standard queries and reports
§ We also want to enable data scientists (informaticians)
§ They are going to generally want to ask more complex and varied
questions
§ Will likely want to retrieve larger data quantities
§ Would be great to help them with their 80% problem
The 80% problem
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Data scientists, according to
interviews and expert
estimates, spend from 50
percent to 80 percent of their
time mired in this more
mundane labor of collecting
and preparing unruly digital
data, before it can be explored
for useful nuggets.
Real-world case studies
§ Avalon:
• Productive, maintained, and in active use for 15 years.
§ MAGMA:
• Productive, maintained, and in active use for >5 years.
§ Entity Warehouse (EW):
• In active development
Avalon
§ One table/view per “fact_type” (maps roughly to assay)
• Typical table has about 10 columns
• Big table has about 100 columns
§ One row per measurement
• 10s of rows for short-lived assays
• Typically hundreds to thousands of rows
• More than a million rows for HTS
§ ~30K tables/views
§ Additional tables defining structure of the fact tables
§ Little metadata
§ Tightly coupled to a UI
MAGMA
§ Intended to be “the” warehouse
§ Similar type of schema as ChEMBL
§ Results stored in a tall and skinny table
§ Columns for all primitive data types (string, float, int, etc)
§ ~2 billion rows
§ Tables with metadata
Entity Warehouse
§ Designed to accommodate both internal and external data
§ Central concept is the entity, entity-entity linkage
§ “Assays” stored as entities with info about their result types and
associated metadata
§ One table per result type (e.g. Activity-Concentration, Activity-Percent)
• About a dozen result types
§ One row per measurement
Current size:
• 10s of millions of rows for Activity-Concentration
• ~100 million rows for Activity-Percent
§ Links/drilldown to original data/systems.
Entity Warehouse: the entity
§ Used to represent the business objects (scientific or otherwise) of
interest
• Compounds
• Samples
• “Assays”
• Proteins
• Projects
• People
• Assay results
• Documents
• etc…
§ Model for entity type stored in a central location
§ Entities can be linked and grouped
Example entity: the small molecule concept
17
Entity Warehouse
§ Designed to accommodate both internal and external data
§ Central concept is the entity, entity-entity linkage
§ “Assays” stored as entities with info about their result types and
associated metadata
§ One table per result type (e.g. Activity-Concentration, Activity-Percent)
• About a dozen result types
§ One row per measurement
Current size:
• 10s of millions of rows for Activity-Concentration
• ~100 million rows for Activity-Percent
§ Links/drilldown to original data/systems.
Example Composite Field Type: Activity
Concentration
19
Loading the data1
§ The Entity Warehouse is only one part of a large, multi-year data
integration project.
§ The majority of the thought and effort has gone into how to properly
integrate heterogeneous internal and external data sources
§ Conversion to entities, link resolution, some normalization
§ Preservation of links to original data systems
§ Strong focus on performance/timeliness of the load
§ Once the data are loaded: make it broadly accessible (helping with
that 80% rule for data scientists)
1The 80% rule affects us too.
CDF architecture
Update Services
Consolidation Layer
Source Layer
Integration Layer
Access LayerVisualization/reporting tools and user interfaces
Entity Services
Entity
Warehouse
Search
Indexes
Custom
Datamart
…
Entities Assays Facts Workflow
Registration
systems
Assay
metadata
systems
Assay data
systems
Logistics
systems
Curation
Framework
Entity
Instance
Reference
Entity &
Property
Definitions
Fact
Instance
Reference
One size really doesn’t fit all
§ Just as there is no perfect database
technology for all situations, we don't
think that there's a perfect research
data warehouse for all use cases.
§ The Entity Warehouse will contain
most of the data and meet 90% of the
needs,1 but there are still going to end
up being multiple “warehouses”
§ We will encourage and support the
building and use of data marts by data
scientists and will make it easy to keep
them up to date
§ The warehouse(s) is/are just one piece
of the full data ecosystem
https://pixabay.com/en/dyke-road-hamburg-port-homes-41832/
1At least we hope so. When it comes to
enabling broad usage of the various types
of 'omics data we'll need to see
One size really doesn’t fit all
https://pixabay.com/en/dyke-road-hamburg-port-homes-41832/
§ Maybe this is actually a hopeful
message from the point of view of
a possible standardized
warehouse
§ If there’s only one warehouse, it’s
probably going to be *mine*
§ If I’m using more than one
“warehouse”, then I’m much more
willing to talk about using
something standardized for one of
them
Acknowledgements
Past and present members of the
Avalon, MAGMA, and CDF teams:
Bernd Rohde
Joe Ringgenberg
Mathias Asp
Andre Zelenkovas
Ryan Muller
Sandra Mueller
Artem Mitrokhin
Recca Chatterjee
Nabil Hachem
Andreas Koeller
Mark Schreiber
Barry Frishberg
Thomas Mueller
Alberto Gobbi
Peter Ertl
Paul Selzer
Werner Braun
and many more
…
Past and present members of
NIBR NX leadership:
Remy Evard
Steve Cleaver
Ken Robbins
Patrick Warren
Andy Palmer

More Related Content

What's hot

What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Role of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicRole of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicDatabricks
 
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceIlya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceNextBio
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIPaul Agapow
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesNeo4j
 
Open PHACTS : Linked Data Future Challenges
Open PHACTS : Linked Data Future ChallengesOpen PHACTS : Linked Data Future Challenges
Open PHACTS : Linked Data Future ChallengesSciBite Limited
 
Citrination-MRS Fall Meeting 2015
Citrination-MRS Fall Meeting 2015Citrination-MRS Fall Meeting 2015
Citrination-MRS Fall Meeting 2015bmeredig
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Materials Data in the 21st Century: From Mishmash to Moneyball
Materials Data in the 21st Century: From Mishmash to MoneyballMaterials Data in the 21st Century: From Mishmash to Moneyball
Materials Data in the 21st Century: From Mishmash to Moneyballbmeredig
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiPistoia Alliance
 
Citation Sentiment
Citation SentimentCitation Sentiment
Citation SentimentDaniel Ecer
 
The Power of Graphs to Analyze Biological Data
The Power of Graphs to Analyze Biological DataThe Power of Graphs to Analyze Biological Data
The Power of Graphs to Analyze Biological Datadatablend
 
Data Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesData Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesPradeeban Kathiravelu, Ph.D.
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 

What's hot (20)

What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Role of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicRole of Data Accessibility During Pandemic
Role of Data Accessibility During Pandemic
 
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceIlya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AI
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
 
Open PHACTS : Linked Data Future Challenges
Open PHACTS : Linked Data Future ChallengesOpen PHACTS : Linked Data Future Challenges
Open PHACTS : Linked Data Future Challenges
 
Citrination-MRS Fall Meeting 2015
Citrination-MRS Fall Meeting 2015Citrination-MRS Fall Meeting 2015
Citrination-MRS Fall Meeting 2015
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Hands-on Introduction to Machine Learning
Hands-on Introduction to Machine LearningHands-on Introduction to Machine Learning
Hands-on Introduction to Machine Learning
 
Materials Data in the 21st Century: From Mishmash to Moneyball
Materials Data in the 21st Century: From Mishmash to MoneyballMaterials Data in the 21st Century: From Mishmash to Moneyball
Materials Data in the 21st Century: From Mishmash to Moneyball
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirti
 
Citation Sentiment
Citation SentimentCitation Sentiment
Citation Sentiment
 
Scibite - We Do.
Scibite - We Do.Scibite - We Do.
Scibite - We Do.
 
The Power of Graphs to Analyze Biological Data
The Power of Graphs to Analyze Biological DataThe Power of Graphs to Analyze Biological Data
The Power of Graphs to Analyze Biological Data
 
Data Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesData Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data Lakes
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 

Viewers also liked

The analytics journey to population health management
The analytics journey to population health managementThe analytics journey to population health management
The analytics journey to population health managementIBM Analytics
 
Big Data and Reducing Healthcare Waste
Big Data and Reducing Healthcare WasteBig Data and Reducing Healthcare Waste
Big Data and Reducing Healthcare WasteBob Simison
 
Jack D Ryger: Hot Air Balloon Trip in Vermont
Jack D Ryger: Hot Air Balloon Trip in VermontJack D Ryger: Hot Air Balloon Trip in Vermont
Jack D Ryger: Hot Air Balloon Trip in VermontJack D. Ryger
 
Con canal nacional interactivo word
Con canal nacional interactivo wordCon canal nacional interactivo word
Con canal nacional interactivo wordGuiidoo Sarmiento
 
Trending and viral story 4th part
Trending and viral story 4th partTrending and viral story 4th part
Trending and viral story 4th partNetMarkersNews
 
Clase d educacion fisica
Clase d educacion fisicaClase d educacion fisica
Clase d educacion fisicajogb64
 
When it comes to Building your team, Who Makes the Cut? by Reo Kobayashi
When it comes to Building your team, Who Makes the Cut? by Reo KobayashiWhen it comes to Building your team, Who Makes the Cut? by Reo Kobayashi
When it comes to Building your team, Who Makes the Cut? by Reo KobayashiReo Kobayashi
 
Forms : a UX manifesto
Forms : a UX manifestoForms : a UX manifesto
Forms : a UX manifestoDanielle Swank
 
1.1. course introduction
1.1. course introduction1.1. course introduction
1.1. course introductionNicholas Wong
 
Paychex Small Business Snapshot: How Does the Election Impact Hiring and Wage...
Paychex Small Business Snapshot: How Does the Election Impact Hiring and Wage...Paychex Small Business Snapshot: How Does the Election Impact Hiring and Wage...
Paychex Small Business Snapshot: How Does the Election Impact Hiring and Wage...Paychex
 
EY Business Barometer - O viziune a creșterii - ediția de toamnă 2016
EY Business Barometer - O viziune a creșterii - ediția de toamnă 2016EY Business Barometer - O viziune a creșterii - ediția de toamnă 2016
EY Business Barometer - O viziune a creșterii - ediția de toamnă 2016Mihaela Matei
 
Microcredit and the culture of reciprocity
Microcredit and the culture of reciprocity Microcredit and the culture of reciprocity
Microcredit and the culture of reciprocity Africa2011
 
Understanding the EU Referendum through IRT
Understanding the EU Referendum through IRTUnderstanding the EU Referendum through IRT
Understanding the EU Referendum through IRTIpsos UK
 
9 Field-Tested, No-Fail Strategies To Help You Succeed In Your Next Negotia...
9 Field-Tested, No-Fail Strategies  To Help You Succeed  In Your Next Negotia...9 Field-Tested, No-Fail Strategies  To Help You Succeed  In Your Next Negotia...
9 Field-Tested, No-Fail Strategies To Help You Succeed In Your Next Negotia...Christopher Voss
 
Information Security Benchmarking 2016
Information Security Benchmarking 2016Information Security Benchmarking 2016
Information Security Benchmarking 2016Capgemini
 

Viewers also liked (18)

The analytics journey to population health management
The analytics journey to population health managementThe analytics journey to population health management
The analytics journey to population health management
 
Big Data and Reducing Healthcare Waste
Big Data and Reducing Healthcare WasteBig Data and Reducing Healthcare Waste
Big Data and Reducing Healthcare Waste
 
KNIME Meetup 2016-04-16
KNIME Meetup 2016-04-16KNIME Meetup 2016-04-16
KNIME Meetup 2016-04-16
 
Jack D Ryger: Hot Air Balloon Trip in Vermont
Jack D Ryger: Hot Air Balloon Trip in VermontJack D Ryger: Hot Air Balloon Trip in Vermont
Jack D Ryger: Hot Air Balloon Trip in Vermont
 
Con canal nacional interactivo word
Con canal nacional interactivo wordCon canal nacional interactivo word
Con canal nacional interactivo word
 
Trending and viral story 4th part
Trending and viral story 4th partTrending and viral story 4th part
Trending and viral story 4th part
 
Estudo de violão
Estudo de violãoEstudo de violão
Estudo de violão
 
Clase d educacion fisica
Clase d educacion fisicaClase d educacion fisica
Clase d educacion fisica
 
When it comes to Building your team, Who Makes the Cut? by Reo Kobayashi
When it comes to Building your team, Who Makes the Cut? by Reo KobayashiWhen it comes to Building your team, Who Makes the Cut? by Reo Kobayashi
When it comes to Building your team, Who Makes the Cut? by Reo Kobayashi
 
Forms : a UX manifesto
Forms : a UX manifestoForms : a UX manifesto
Forms : a UX manifesto
 
1.1. course introduction
1.1. course introduction1.1. course introduction
1.1. course introduction
 
Leading in the Age of Rransparency
Leading in the Age of RransparencyLeading in the Age of Rransparency
Leading in the Age of Rransparency
 
Paychex Small Business Snapshot: How Does the Election Impact Hiring and Wage...
Paychex Small Business Snapshot: How Does the Election Impact Hiring and Wage...Paychex Small Business Snapshot: How Does the Election Impact Hiring and Wage...
Paychex Small Business Snapshot: How Does the Election Impact Hiring and Wage...
 
EY Business Barometer - O viziune a creșterii - ediția de toamnă 2016
EY Business Barometer - O viziune a creșterii - ediția de toamnă 2016EY Business Barometer - O viziune a creșterii - ediția de toamnă 2016
EY Business Barometer - O viziune a creșterii - ediția de toamnă 2016
 
Microcredit and the culture of reciprocity
Microcredit and the culture of reciprocity Microcredit and the culture of reciprocity
Microcredit and the culture of reciprocity
 
Understanding the EU Referendum through IRT
Understanding the EU Referendum through IRTUnderstanding the EU Referendum through IRT
Understanding the EU Referendum through IRT
 
9 Field-Tested, No-Fail Strategies To Help You Succeed In Your Next Negotia...
9 Field-Tested, No-Fail Strategies  To Help You Succeed  In Your Next Negotia...9 Field-Tested, No-Fail Strategies  To Help You Succeed  In Your Next Negotia...
9 Field-Tested, No-Fail Strategies To Help You Succeed In Your Next Negotia...
 
Information Security Benchmarking 2016
Information Security Benchmarking 2016Information Security Benchmarking 2016
Information Security Benchmarking 2016
 

Similar to Is one enough? Data warehousing for biomedical research

No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciencesChris Dwan
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingJason S
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneousChris Dwan
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...sesrdm
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
Noise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataNoise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataDATAVERSITY
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateNeuroscience Information Framework
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Bioinformatics&Databases.ppt
Bioinformatics&Databases.pptBioinformatics&Databases.ppt
Bioinformatics&Databases.pptBlackHunt1
 
Emcien overview v6 01282013
Emcien overview v6 01282013Emcien overview v6 01282013
Emcien overview v6 01282013WCJones6348
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 

Similar to Is one enough? Data warehousing for biomedical research (20)

No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
 
Big Data in Clinical Research
Big Data in Clinical ResearchBig Data in Clinical Research
Big Data in Clinical Research
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Noise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataNoise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in Data
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Big Data
Big Data Big Data
Big Data
 
Bioinformatics&Databases.ppt
Bioinformatics&Databases.pptBioinformatics&Databases.ppt
Bioinformatics&Databases.ppt
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Emcien overview v6 01282013
Emcien overview v6 01282013Emcien overview v6 01282013
Emcien overview v6 01282013
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 

More from Greg Landrum

Chemical registration
Chemical registrationChemical registration
Chemical registrationGreg Landrum
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Greg Landrum
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Greg Landrum
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsGreg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Greg Landrum
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningGreg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Greg Landrum
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysisGreg Landrum
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? Greg Landrum
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Greg Landrum
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialGreg Landrum
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontGreg Landrum
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesGreg Landrum
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Greg Landrum
 

More from Greg Landrum (15)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 

Recently uploaded

THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxSimeonChristian
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 

Is one enough? Data warehousing for biomedical research

  • 1. Is one enough? Data warehousing for biomedical research Gregory Landrum1, Matthias Wrobel2, Nicholas Clare2 1 KNIME.com AG 2 Novartis Institutes for BioMedical Research, Basel 2016 Basel Life Sciences Week 21 September 2016
  • 2. Overview § Motivation: why is this both important and hard? § Three data warehouse case studies § Is one enough? 2
  • 3. Challenges for real world data management and analysis § Lots of heterogeneous data from multiple sources, both internal and external § Source data are frequently messy and unstructured § Constant flow of new data into the system § Diverse stakeholders and users § Highly diverse and complex questions to ask of the data § Serious performance requirements
  • 4. Storing and managing real-world data § Warehouse vs mart vs federation vs “data lake” vs linked data/triple store vs … § Many, many different approaches, technologies, and architectures. § Most are applicable in some scenarios but there is no silver bullet. Stonebraker, Michael, and Uğur Çetintemel. ”One size fits all": an idea whose time has come and gone. Proceedings, 21st International Conference on Data Engineering, 2005. ICDE 2005. IEEE, 2005. https://cs.brown.edu/~ugur/fits_all.pdf
  • 5. Storing the data isn’t the end of the story You probably want to be able to get the data back out.
  • 6. Storing the data isn’t the end of the story http://flickr.com/photos/35703177@N00/1063555182 You probably want to be able to get the data back out. Extracting insights from a data lake Jokes aside, allowing the data to be queried and retrieved efficiently is essential
  • 7. Nature of the data Shape of the data generated for a project Hit finding 106 rows, 1-2 columns Hit-to-lead 103 rows, 5-10 columns Lead optimization 102 rows, 102 columns Clinic 1 rows, 104 columns “omics” data (can appear at multiple stages) is different still
  • 8. Query/report type 1: Specialized search Many rows Few columns
  • 9. Query/report type 2: Basic search Few rows Many columns
  • 10. It’s more than just standard queries and reports § We also want to enable data scientists (informaticians) § They are going to generally want to ask more complex and varied questions § Will likely want to retrieve larger data quantities § Would be great to help them with their 80% problem
  • 11. The 80% problem http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
  • 12. Real-world case studies § Avalon: • Productive, maintained, and in active use for 15 years. § MAGMA: • Productive, maintained, and in active use for >5 years. § Entity Warehouse (EW): • In active development
  • 13. Avalon § One table/view per “fact_type” (maps roughly to assay) • Typical table has about 10 columns • Big table has about 100 columns § One row per measurement • 10s of rows for short-lived assays • Typically hundreds to thousands of rows • More than a million rows for HTS § ~30K tables/views § Additional tables defining structure of the fact tables § Little metadata § Tightly coupled to a UI
  • 14. MAGMA § Intended to be “the” warehouse § Similar type of schema as ChEMBL § Results stored in a tall and skinny table § Columns for all primitive data types (string, float, int, etc) § ~2 billion rows § Tables with metadata
  • 15. Entity Warehouse § Designed to accommodate both internal and external data § Central concept is the entity, entity-entity linkage § “Assays” stored as entities with info about their result types and associated metadata § One table per result type (e.g. Activity-Concentration, Activity-Percent) • About a dozen result types § One row per measurement Current size: • 10s of millions of rows for Activity-Concentration • ~100 million rows for Activity-Percent § Links/drilldown to original data/systems.
  • 16. Entity Warehouse: the entity § Used to represent the business objects (scientific or otherwise) of interest • Compounds • Samples • “Assays” • Proteins • Projects • People • Assay results • Documents • etc… § Model for entity type stored in a central location § Entities can be linked and grouped
  • 17. Example entity: the small molecule concept 17
  • 18. Entity Warehouse § Designed to accommodate both internal and external data § Central concept is the entity, entity-entity linkage § “Assays” stored as entities with info about their result types and associated metadata § One table per result type (e.g. Activity-Concentration, Activity-Percent) • About a dozen result types § One row per measurement Current size: • 10s of millions of rows for Activity-Concentration • ~100 million rows for Activity-Percent § Links/drilldown to original data/systems.
  • 19. Example Composite Field Type: Activity Concentration 19
  • 20. Loading the data1 § The Entity Warehouse is only one part of a large, multi-year data integration project. § The majority of the thought and effort has gone into how to properly integrate heterogeneous internal and external data sources § Conversion to entities, link resolution, some normalization § Preservation of links to original data systems § Strong focus on performance/timeliness of the load § Once the data are loaded: make it broadly accessible (helping with that 80% rule for data scientists) 1The 80% rule affects us too.
  • 21. CDF architecture Update Services Consolidation Layer Source Layer Integration Layer Access LayerVisualization/reporting tools and user interfaces Entity Services Entity Warehouse Search Indexes Custom Datamart … Entities Assays Facts Workflow Registration systems Assay metadata systems Assay data systems Logistics systems Curation Framework Entity Instance Reference Entity & Property Definitions Fact Instance Reference
  • 22. One size really doesn’t fit all § Just as there is no perfect database technology for all situations, we don't think that there's a perfect research data warehouse for all use cases. § The Entity Warehouse will contain most of the data and meet 90% of the needs,1 but there are still going to end up being multiple “warehouses” § We will encourage and support the building and use of data marts by data scientists and will make it easy to keep them up to date § The warehouse(s) is/are just one piece of the full data ecosystem https://pixabay.com/en/dyke-road-hamburg-port-homes-41832/ 1At least we hope so. When it comes to enabling broad usage of the various types of 'omics data we'll need to see
  • 23. One size really doesn’t fit all https://pixabay.com/en/dyke-road-hamburg-port-homes-41832/ § Maybe this is actually a hopeful message from the point of view of a possible standardized warehouse § If there’s only one warehouse, it’s probably going to be *mine* § If I’m using more than one “warehouse”, then I’m much more willing to talk about using something standardized for one of them
  • 24. Acknowledgements Past and present members of the Avalon, MAGMA, and CDF teams: Bernd Rohde Joe Ringgenberg Mathias Asp Andre Zelenkovas Ryan Muller Sandra Mueller Artem Mitrokhin Recca Chatterjee Nabil Hachem Andreas Koeller Mark Schreiber Barry Frishberg Thomas Mueller Alberto Gobbi Peter Ertl Paul Selzer Werner Braun and many more … Past and present members of NIBR NX leadership: Remy Evard Steve Cleaver Ken Robbins Patrick Warren Andy Palmer