SlideShare uma empresa Scribd logo
1 de 17
1
Data Café — A Platform For Creating
Biomedical Data Lakes
Pradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2
1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
2 Department of Biomedical Informatics, Emory University, Atlanta, USA
www.sharmalab.info
2
Data Landscape
for Precision Medicine
DATA
CHARACTERISTICS
• Large number of small datasets
• Structured…Semi-structured
…Unstructured…Ill formed
• Noisy and Fuzzy/Uncertain
• Spatial, Temporal relationships
DATA MANAGEMENT
• Variety in storage and messaging
protocols
• No shared interface
3
Illustrative Use Case
Execute a Radiogenomics workflow on the diffusion images of GBM
patients who received a TMZ + experimental regimen with an overall
survival of 18months or more.
Execute a Radiogenomics workflow on the diffusion images of GBM
patients who received a TMZ + experimental regimen with an overall
survival of 18months or more
PACS + EMR + AIM + RT + Molecular
4
Motivation
• Most current solutions require a DBA to initiate the migration of data into
a Data Warehousing environment
• to query and explore all the data at once.
• Costly to set up such warehouses.
• Unified warehouse with access to query and explore the data.
• Limitations
• Scalability and extensibility to incorporate new data sources
• A priori knowledge of the data models of the different data sources.
BIOMEDICAL DATA LAKES
• Cohort Discovery and Creation — Assembled per-study
• Heterogeneous data collected in a loosely structured fashion.
• Agile and easy to create.
• Integrate with data exploration/visualization via REST APIs.
• Problem or hypothesis specific virtual data set.
• Powered by Drill + HDFS, Data Sources via APIs.
6
Data Café
• An agile approach to creating and extending the concept of a star
schema
• to model a problem/hypothesis specific dataset.
• by leveraging Apache Drill to easily query the data.
• Tackles the limitations in the existing approaches.
• Provides researchers the ability to add new data models and sources.
7
Core Concepts
Step 1. Given a set of data sources,
create a graphical representation of
the join attributes.
This graph represents how data is
connected across the various data
sources
8
Core Concepts
Step 2. Run a set of parallel queries on
the data sources that include the
attributes that are present in the
query graph.
In the top figure, our query is of type:
{id1: A1 > x and B2 == y}
We run similar queries across C, D and
E and retrieve the set of relevant id’s
(join attributes).
9
Core Concepts
Step 3. Compute intersection across
the various id’s (join attributes). The
data of interest can now be obtained
using the id’s in this intersection.
A subsequent query will allow us to
stream, in parallel, data from
individual sources, given the relevant
ids (join attributes)
10
Data Café Architecture
11
Apache Drill
• Variety – Query a range of non-relational data sources.
• Flexibility.
• Agility – Faster Insights.
• Scalability.
12
Evaluation Environment
• Data Café was deployed along with the data sources and Drill in Amazon
EC2.
• MongoDB instantiated in EC2 instances.
• Hive on Amazon EMR (Elastic MapReduce).
• EMR HDFS was configured with 3 nodes.
• Various datasets for evaluation
• Two synthetic datasets.
• Clinical Data from the TCGA BRCA collection
13
Results
• Quick creation of data lakes
• without prior knowledge of the data schema.
• Very fast execution of large queries
• with Apache Drill.
• Data Café can be an efficient platform for exploring an integrated data
source.
• Integrated data source construction process may be time consuming.
• Less critical path.
• Done less frequently than the data queries from HDFS/Hive using Drill.
14
Conclusion
• A novel platform for integrating multiple data sources.
• Without a priori knowledge of the data models of the sources that are being
integrated.
• Indices to do the actual integration
• Enables parallelizing the push of the actual data into HDFS.
• Apache Drill as a fast query execution engine that supports SQL.
• Currently ingesting data from TCGA.
15
Current State and Future Plans
• Ongoing efforts to evaluate the platform with diverse and heterogeneous data
sources.
• Expanding to a larger multi-node distributed cluster.
• Integration with DataScope.
• Multiple data stores and larger data sets.
• Integration with imaging clients such as caMicroscope, as well as archives such
as The Cancer Imaging Archive (TCIA).
Acknowledgements
Google Summer of Code 2015
NCIP/Leidos 14X138, caMicroscope
— A Digital Pathology Integrative
Query System; Ashish Sharma PI
Emory/WUSTL/Stony Brook
NCI U01 [1U01CA187013-01],
Resources for development and
validation of Radiomic Analyses &
Adaptive Therapy, Fred Prior, Ashish
Sharma (UAMS, Emory)
The results published here are in part
based upon data generated by the
TCGA Research Network:
http://cancergenome.nih.gov/
For more information
including recent updates
please visit:
www.sharmalab.info
ashish.sharma@emory.edu

Mais conteúdo relacionado

Mais procurados

From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...Databricks
 
Data cloud lab version v.001.2020
Data cloud lab version v.001.2020Data cloud lab version v.001.2020
Data cloud lab version v.001.2020mdcdwh
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...OpenAIRE
 
EDI Training Module 12: An Introduction to Metadata and Data Repositories
EDI Training Module 12:  An Introduction to Metadata and Data RepositoriesEDI Training Module 12:  An Introduction to Metadata and Data Repositories
EDI Training Module 12: An Introduction to Metadata and Data RepositoriesEnvironmental Data Initiative
 
Role of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly worksRole of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly worksOpenAIRE
 
New PID developments
New PID developmentsNew PID developments
New PID developmentsOpenAIRE
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databasestusharjadhav2611
 
9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solutionStatice
 
data warehousing and data mining
data warehousing and data mining data warehousing and data mining
data warehousing and data mining E2MATRIX
 
CRM - Data Collection, Storage and Acces.
CRM - Data Collection, Storage and Acces.CRM - Data Collection, Storage and Acces.
CRM - Data Collection, Storage and Acces.Vishwas Sankhe
 
Lambda Architecture The Hive
Lambda Architecture The HiveLambda Architecture The Hive
Lambda Architecture The HiveAltan Khendup
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technologyDataminingTools Inc
 
It Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got SemanticsIt Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got SemanticsOntotext
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2Mahmoud Alfarra
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRamakant Soni
 
ORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaCrossref
 

Mais procurados (20)

From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
 
Data mining
Data miningData mining
Data mining
 
Data cloud lab version v.001.2020
Data cloud lab version v.001.2020Data cloud lab version v.001.2020
Data cloud lab version v.001.2020
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
 
EDI Training Module 12: An Introduction to Metadata and Data Repositories
EDI Training Module 12:  An Introduction to Metadata and Data RepositoriesEDI Training Module 12:  An Introduction to Metadata and Data Repositories
EDI Training Module 12: An Introduction to Metadata and Data Repositories
 
Role of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly worksRole of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly works
 
New PID developments
New PID developmentsNew PID developments
New PID developments
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
 
9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution
 
Data mining
Data miningData mining
Data mining
 
data warehousing and data mining
data warehousing and data mining data warehousing and data mining
data warehousing and data mining
 
CRM - Data Collection, Storage and Acces.
CRM - Data Collection, Storage and Acces.CRM - Data Collection, Storage and Acces.
CRM - Data Collection, Storage and Acces.
 
The Big Metadata
The Big MetadataThe Big Metadata
The Big Metadata
 
Lambda Architecture The Hive
Lambda Architecture The HiveLambda Architecture The Hive
Lambda Architecture The Hive
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
 
It Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got SemanticsIt Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got Semantics
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 
ORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE Indonesia
 

Destaque

EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data CommonsVivien Bonazzi
 
Entrance test for teacher 2013...
Entrance test for teacher 2013...Entrance test for teacher 2013...
Entrance test for teacher 2013...Ashish Sharma
 
From protein interaction networks to human phenotypes
From protein  interaction networks to human phenotypesFrom protein  interaction networks to human phenotypes
From protein interaction networks to human phenotypesbiocs
 
Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...Neil Saunders
 
Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations
Leveraging Wikipedia-based Features for Entity Relatedness and RecommendationsLeveraging Wikipedia-based Features for Entity Relatedness and Recommendations
Leveraging Wikipedia-based Features for Entity Relatedness and RecommendationsNitish Aggarwal
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionAmrapali Zaveri, PhD
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalismBahareh Heravi
 
Linked data in the digital humanities skills workshop for realising the oppo...
Linked data in the digital humanities  skills workshop for realising the oppo...Linked data in the digital humanities  skills workshop for realising the oppo...
Linked data in the digital humanities skills workshop for realising the oppo...jodischneider
 
Beyond Journalism Chicago
Beyond Journalism ChicagoBeyond Journalism Chicago
Beyond Journalism ChicagoMark Deuze
 
Harrower Heravi RDA P4 Social media
Harrower Heravi RDA P4 Social mediaHarrower Heravi RDA P4 Social media
Harrower Heravi RDA P4 Social mediadri_ireland
 
Specificity and Evolvability in Eukaryotic Protein Interaction Networks
Specificity and Evolvability in Eukaryotic Protein Interaction NetworksSpecificity and Evolvability in Eukaryotic Protein Interaction Networks
Specificity and Evolvability in Eukaryotic Protein Interaction Networkspedrobeltrao
 
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...Ronak Shah
 
Identifying, annotating, and filtering arguments and opinions on the social w...
Identifying, annotating, and filtering arguments and opinions on the social w...Identifying, annotating, and filtering arguments and opinions on the social w...
Identifying, annotating, and filtering arguments and opinions on the social w...jodischneider
 
Combining sequence motifs and protein interactions to unravel complex phospho...
Combining sequence motifs and protein interactions to unravel complex phospho...Combining sequence motifs and protein interactions to unravel complex phospho...
Combining sequence motifs and protein interactions to unravel complex phospho...Lars Juhl Jensen
 
PhD viva - 11th November 2015
PhD viva - 11th November 2015PhD viva - 11th November 2015
PhD viva - 11th November 2015Kevin Keraudren
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD VivaAidan Hogan
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...Pradeeban Kathiravelu, Ph.D.
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavisSean Davis
 
Sabrina Kirrane INSIGHT Viva Presentation
Sabrina Kirrane INSIGHT Viva Presentation Sabrina Kirrane INSIGHT Viva Presentation
Sabrina Kirrane INSIGHT Viva Presentation Sabrina Kirrane
 
Industry Report: The State of Customer Data Integration in 2013
Industry Report: The State of Customer Data Integration in 2013Industry Report: The State of Customer Data Integration in 2013
Industry Report: The State of Customer Data Integration in 2013Scribe Software Corp.
 

Destaque (20)

EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data Commons
 
Entrance test for teacher 2013...
Entrance test for teacher 2013...Entrance test for teacher 2013...
Entrance test for teacher 2013...
 
From protein interaction networks to human phenotypes
From protein  interaction networks to human phenotypesFrom protein  interaction networks to human phenotypes
From protein interaction networks to human phenotypes
 
Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...
 
Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations
Leveraging Wikipedia-based Features for Entity Relatedness and RecommendationsLeveraging Wikipedia-based Features for Entity Relatedness and Recommendations
Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalism
 
Linked data in the digital humanities skills workshop for realising the oppo...
Linked data in the digital humanities  skills workshop for realising the oppo...Linked data in the digital humanities  skills workshop for realising the oppo...
Linked data in the digital humanities skills workshop for realising the oppo...
 
Beyond Journalism Chicago
Beyond Journalism ChicagoBeyond Journalism Chicago
Beyond Journalism Chicago
 
Harrower Heravi RDA P4 Social media
Harrower Heravi RDA P4 Social mediaHarrower Heravi RDA P4 Social media
Harrower Heravi RDA P4 Social media
 
Specificity and Evolvability in Eukaryotic Protein Interaction Networks
Specificity and Evolvability in Eukaryotic Protein Interaction NetworksSpecificity and Evolvability in Eukaryotic Protein Interaction Networks
Specificity and Evolvability in Eukaryotic Protein Interaction Networks
 
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
 
Identifying, annotating, and filtering arguments and opinions on the social w...
Identifying, annotating, and filtering arguments and opinions on the social w...Identifying, annotating, and filtering arguments and opinions on the social w...
Identifying, annotating, and filtering arguments and opinions on the social w...
 
Combining sequence motifs and protein interactions to unravel complex phospho...
Combining sequence motifs and protein interactions to unravel complex phospho...Combining sequence motifs and protein interactions to unravel complex phospho...
Combining sequence motifs and protein interactions to unravel complex phospho...
 
PhD viva - 11th November 2015
PhD viva - 11th November 2015PhD viva - 11th November 2015
PhD viva - 11th November 2015
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
 
Sabrina Kirrane INSIGHT Viva Presentation
Sabrina Kirrane INSIGHT Viva Presentation Sabrina Kirrane INSIGHT Viva Presentation
Sabrina Kirrane INSIGHT Viva Presentation
 
Industry Report: The State of Customer Data Integration in 2013
Industry Report: The State of Customer Data Integration in 2013Industry Report: The State of Customer Data Integration in 2013
Industry Report: The State of Customer Data Integration in 2013
 

Semelhante a Data Café — A Platform For Creating Biomedical Data Lakes

Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesUri Laserson
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
 
Dataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
Green Shoots: Research Data Management Pilot at Imperial College London
Green Shoots:Research Data Management Pilot at Imperial College LondonGreen Shoots:Research Data Management Pilot at Imperial College London
Green Shoots: Research Data Management Pilot at Imperial College LondonTorsten Reimer
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.pptNamrataBhatt8
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...AKSHAY BHAGAT
 
Using The Hadoop Ecosystem to Drive Healthcare Innovation
Using The Hadoop Ecosystem to Drive Healthcare InnovationUsing The Hadoop Ecosystem to Drive Healthcare Innovation
Using The Hadoop Ecosystem to Drive Healthcare InnovationDan Wellisch
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 

Semelhante a Data Café — A Platform For Creating Biomedical Data Lakes (20)

Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Dataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTags
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Green Shoots: Research Data Management Pilot at Imperial College London
Green Shoots:Research Data Management Pilot at Imperial College LondonGreen Shoots:Research Data Management Pilot at Imperial College London
Green Shoots: Research Data Management Pilot at Imperial College London
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.ppt
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
Using The Hadoop Ecosystem to Drive Healthcare Innovation
Using The Hadoop Ecosystem to Drive Healthcare InnovationUsing The Hadoop Ecosystem to Drive Healthcare Innovation
Using The Hadoop Ecosystem to Drive Healthcare Innovation
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Big Data
Big Data Big Data
Big Data
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 

Mais de Pradeeban Kathiravelu, Ph.D.

Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Pradeeban Kathiravelu, Ph.D.
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...Pradeeban Kathiravelu, Ph.D.
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesPradeeban Kathiravelu, Ph.D.
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreePradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...Pradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Pradeeban Kathiravelu, Ph.D.
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersPradeeban Kathiravelu, Ph.D.
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Pradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesPradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Pradeeban Kathiravelu, Ph.D.
 

Mais de Pradeeban Kathiravelu, Ph.D. (20)

Google Summer of Code_2023.pdf
Google Summer of Code_2023.pdfGoogle Summer of Code_2023.pdf
Google Summer of Code_2023.pdf
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
 
Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
 
Google Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentorsGoogle Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentors
 
Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
 
UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routers
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big Services
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 
Componentizing Big Services in the Internet
Componentizing Big Services in the InternetComponentizing Big Services in the Internet
Componentizing Big Services in the Internet
 

Último

Evidence-based resources -2023-PRUH SS.pptx
Evidence-based resources -2023-PRUH SS.pptxEvidence-based resources -2023-PRUH SS.pptx
Evidence-based resources -2023-PRUH SS.pptxMrs S Sen
 
Low Vision Case (Nisreen mokhanawala).pptx
Low Vision Case (Nisreen mokhanawala).pptxLow Vision Case (Nisreen mokhanawala).pptx
Low Vision Case (Nisreen mokhanawala).pptxShubham
 
Presentation for Alzheimers Disease.pptx
Presentation for Alzheimers Disease.pptxPresentation for Alzheimers Disease.pptx
Presentation for Alzheimers Disease.pptxravisutar1
 
20 Benefits of Empathetic Listening in Mental Health Support
20 Benefits of Empathetic Listening in Mental Health Support20 Benefits of Empathetic Listening in Mental Health Support
20 Benefits of Empathetic Listening in Mental Health SupportSayhey
 
Learn Tips for Managing Chemobrain or Mental Fogginess
Learn Tips for Managing Chemobrain or Mental FogginessLearn Tips for Managing Chemobrain or Mental Fogginess
Learn Tips for Managing Chemobrain or Mental Fogginessbkling
 
Leading big change: what does it take to deliver at large scale?
Leading big change: what does it take to deliver at large scale?Leading big change: what does it take to deliver at large scale?
Leading big change: what does it take to deliver at large scale?HelenBevan4
 
What are weight loss medication services?
What are weight loss medication services?What are weight loss medication services?
What are weight loss medication services?Optimal Healing 4u
 
Enhancing Health Through Personalized Nutrition
Enhancing Health Through Personalized NutritionEnhancing Health Through Personalized Nutrition
Enhancing Health Through Personalized NutritionNeighborhood Trainer
 
TEENAGE PREGNANCY PREVENTION AND AWARENESS
TEENAGE PREGNANCY PREVENTION AND AWARENESSTEENAGE PREGNANCY PREVENTION AND AWARENESS
TEENAGE PREGNANCY PREVENTION AND AWARENESSPeterJamesVitug
 
Subconjunctival Haemorrhage,causes,treatment..pptx
Subconjunctival Haemorrhage,causes,treatment..pptxSubconjunctival Haemorrhage,causes,treatment..pptx
Subconjunctival Haemorrhage,causes,treatment..pptxvideosfildr
 
Advance Directives and Advance Care Planning: Ensuring Patient Voices Are Heard
Advance Directives and Advance Care Planning: Ensuring Patient Voices Are HeardAdvance Directives and Advance Care Planning: Ensuring Patient Voices Are Heard
Advance Directives and Advance Care Planning: Ensuring Patient Voices Are HeardVITASAuthor
 
Latest Dr Ranjit Jagtap News In Healthcare Field
Latest Dr Ranjit Jagtap News In Healthcare  FieldLatest Dr Ranjit Jagtap News In Healthcare  Field
Latest Dr Ranjit Jagtap News In Healthcare FieldDr Ranjit Jagtap
 
Immediate care of newborn, midwifery and obstetrical nursing
Immediate care of newborn, midwifery and obstetrical nursingImmediate care of newborn, midwifery and obstetrical nursing
Immediate care of newborn, midwifery and obstetrical nursingNursing education
 
Medisep insurance policy , new kerala government insurance policy for govrnm...
Medisep insurance policy , new  kerala government insurance policy for govrnm...Medisep insurance policy , new  kerala government insurance policy for govrnm...
Medisep insurance policy , new kerala government insurance policy for govrnm...LinshaLichu1
 
Field exchange, Issue 72 April 2024 FEX-72.pdf
Field exchange, Issue 72 April 2024 FEX-72.pdfField exchange, Issue 72 April 2024 FEX-72.pdf
Field exchange, Issue 72 April 2024 FEX-72.pdfMohamed Miyir
 
CROHNS DISEASE.pptx by Dr. Chayanika Das
CROHNS DISEASE.pptx by Dr. Chayanika DasCROHNS DISEASE.pptx by Dr. Chayanika Das
CROHNS DISEASE.pptx by Dr. Chayanika DasChayanika Das
 
Understanding Cholera: Epidemiology, Prevention, and Control.pdf
Understanding Cholera: Epidemiology, Prevention, and Control.pdfUnderstanding Cholera: Epidemiology, Prevention, and Control.pdf
Understanding Cholera: Epidemiology, Prevention, and Control.pdfSasikiranMarri
 
Local Advanced Esophageal Cancer (T3-4N0-2M0): Artificial Intelligence, Syner...
Local Advanced Esophageal Cancer (T3-4N0-2M0): Artificial Intelligence, Syner...Local Advanced Esophageal Cancer (T3-4N0-2M0): Artificial Intelligence, Syner...
Local Advanced Esophageal Cancer (T3-4N0-2M0): Artificial Intelligence, Syner...Oleg Kshivets
 

Último (20)

Evidence-based resources -2023-PRUH SS.pptx
Evidence-based resources -2023-PRUH SS.pptxEvidence-based resources -2023-PRUH SS.pptx
Evidence-based resources -2023-PRUH SS.pptx
 
Low Vision Case (Nisreen mokhanawala).pptx
Low Vision Case (Nisreen mokhanawala).pptxLow Vision Case (Nisreen mokhanawala).pptx
Low Vision Case (Nisreen mokhanawala).pptx
 
DELIRIUM psychiatric delirium is a organic mental disorder
DELIRIUM  psychiatric  delirium is a organic mental disorderDELIRIUM  psychiatric  delirium is a organic mental disorder
DELIRIUM psychiatric delirium is a organic mental disorder
 
Presentation for Alzheimers Disease.pptx
Presentation for Alzheimers Disease.pptxPresentation for Alzheimers Disease.pptx
Presentation for Alzheimers Disease.pptx
 
20 Benefits of Empathetic Listening in Mental Health Support
20 Benefits of Empathetic Listening in Mental Health Support20 Benefits of Empathetic Listening in Mental Health Support
20 Benefits of Empathetic Listening in Mental Health Support
 
Learn Tips for Managing Chemobrain or Mental Fogginess
Learn Tips for Managing Chemobrain or Mental FogginessLearn Tips for Managing Chemobrain or Mental Fogginess
Learn Tips for Managing Chemobrain or Mental Fogginess
 
Leading big change: what does it take to deliver at large scale?
Leading big change: what does it take to deliver at large scale?Leading big change: what does it take to deliver at large scale?
Leading big change: what does it take to deliver at large scale?
 
What are weight loss medication services?
What are weight loss medication services?What are weight loss medication services?
What are weight loss medication services?
 
Enhancing Health Through Personalized Nutrition
Enhancing Health Through Personalized NutritionEnhancing Health Through Personalized Nutrition
Enhancing Health Through Personalized Nutrition
 
TEENAGE PREGNANCY PREVENTION AND AWARENESS
TEENAGE PREGNANCY PREVENTION AND AWARENESSTEENAGE PREGNANCY PREVENTION AND AWARENESS
TEENAGE PREGNANCY PREVENTION AND AWARENESS
 
Kidney Transplant At Hiranandani Hospital
Kidney Transplant At Hiranandani HospitalKidney Transplant At Hiranandani Hospital
Kidney Transplant At Hiranandani Hospital
 
Subconjunctival Haemorrhage,causes,treatment..pptx
Subconjunctival Haemorrhage,causes,treatment..pptxSubconjunctival Haemorrhage,causes,treatment..pptx
Subconjunctival Haemorrhage,causes,treatment..pptx
 
Advance Directives and Advance Care Planning: Ensuring Patient Voices Are Heard
Advance Directives and Advance Care Planning: Ensuring Patient Voices Are HeardAdvance Directives and Advance Care Planning: Ensuring Patient Voices Are Heard
Advance Directives and Advance Care Planning: Ensuring Patient Voices Are Heard
 
Latest Dr Ranjit Jagtap News In Healthcare Field
Latest Dr Ranjit Jagtap News In Healthcare  FieldLatest Dr Ranjit Jagtap News In Healthcare  Field
Latest Dr Ranjit Jagtap News In Healthcare Field
 
Immediate care of newborn, midwifery and obstetrical nursing
Immediate care of newborn, midwifery and obstetrical nursingImmediate care of newborn, midwifery and obstetrical nursing
Immediate care of newborn, midwifery and obstetrical nursing
 
Medisep insurance policy , new kerala government insurance policy for govrnm...
Medisep insurance policy , new  kerala government insurance policy for govrnm...Medisep insurance policy , new  kerala government insurance policy for govrnm...
Medisep insurance policy , new kerala government insurance policy for govrnm...
 
Field exchange, Issue 72 April 2024 FEX-72.pdf
Field exchange, Issue 72 April 2024 FEX-72.pdfField exchange, Issue 72 April 2024 FEX-72.pdf
Field exchange, Issue 72 April 2024 FEX-72.pdf
 
CROHNS DISEASE.pptx by Dr. Chayanika Das
CROHNS DISEASE.pptx by Dr. Chayanika DasCROHNS DISEASE.pptx by Dr. Chayanika Das
CROHNS DISEASE.pptx by Dr. Chayanika Das
 
Understanding Cholera: Epidemiology, Prevention, and Control.pdf
Understanding Cholera: Epidemiology, Prevention, and Control.pdfUnderstanding Cholera: Epidemiology, Prevention, and Control.pdf
Understanding Cholera: Epidemiology, Prevention, and Control.pdf
 
Local Advanced Esophageal Cancer (T3-4N0-2M0): Artificial Intelligence, Syner...
Local Advanced Esophageal Cancer (T3-4N0-2M0): Artificial Intelligence, Syner...Local Advanced Esophageal Cancer (T3-4N0-2M0): Artificial Intelligence, Syner...
Local Advanced Esophageal Cancer (T3-4N0-2M0): Artificial Intelligence, Syner...
 

Data Café — A Platform For Creating Biomedical Data Lakes

  • 1. 1 Data Café — A Platform For Creating Biomedical Data Lakes Pradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2 1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal 2 Department of Biomedical Informatics, Emory University, Atlanta, USA www.sharmalab.info
  • 2. 2 Data Landscape for Precision Medicine DATA CHARACTERISTICS • Large number of small datasets • Structured…Semi-structured …Unstructured…Ill formed • Noisy and Fuzzy/Uncertain • Spatial, Temporal relationships DATA MANAGEMENT • Variety in storage and messaging protocols • No shared interface
  • 3. 3 Illustrative Use Case Execute a Radiogenomics workflow on the diffusion images of GBM patients who received a TMZ + experimental regimen with an overall survival of 18months or more. Execute a Radiogenomics workflow on the diffusion images of GBM patients who received a TMZ + experimental regimen with an overall survival of 18months or more PACS + EMR + AIM + RT + Molecular
  • 4. 4 Motivation • Most current solutions require a DBA to initiate the migration of data into a Data Warehousing environment • to query and explore all the data at once. • Costly to set up such warehouses. • Unified warehouse with access to query and explore the data. • Limitations • Scalability and extensibility to incorporate new data sources • A priori knowledge of the data models of the different data sources.
  • 5. BIOMEDICAL DATA LAKES • Cohort Discovery and Creation — Assembled per-study • Heterogeneous data collected in a loosely structured fashion. • Agile and easy to create. • Integrate with data exploration/visualization via REST APIs. • Problem or hypothesis specific virtual data set. • Powered by Drill + HDFS, Data Sources via APIs.
  • 6. 6 Data Café • An agile approach to creating and extending the concept of a star schema • to model a problem/hypothesis specific dataset. • by leveraging Apache Drill to easily query the data. • Tackles the limitations in the existing approaches. • Provides researchers the ability to add new data models and sources.
  • 7. 7 Core Concepts Step 1. Given a set of data sources, create a graphical representation of the join attributes. This graph represents how data is connected across the various data sources
  • 8. 8 Core Concepts Step 2. Run a set of parallel queries on the data sources that include the attributes that are present in the query graph. In the top figure, our query is of type: {id1: A1 > x and B2 == y} We run similar queries across C, D and E and retrieve the set of relevant id’s (join attributes).
  • 9. 9 Core Concepts Step 3. Compute intersection across the various id’s (join attributes). The data of interest can now be obtained using the id’s in this intersection. A subsequent query will allow us to stream, in parallel, data from individual sources, given the relevant ids (join attributes)
  • 11. 11 Apache Drill • Variety – Query a range of non-relational data sources. • Flexibility. • Agility – Faster Insights. • Scalability.
  • 12. 12 Evaluation Environment • Data Café was deployed along with the data sources and Drill in Amazon EC2. • MongoDB instantiated in EC2 instances. • Hive on Amazon EMR (Elastic MapReduce). • EMR HDFS was configured with 3 nodes. • Various datasets for evaluation • Two synthetic datasets. • Clinical Data from the TCGA BRCA collection
  • 13. 13 Results • Quick creation of data lakes • without prior knowledge of the data schema. • Very fast execution of large queries • with Apache Drill. • Data Café can be an efficient platform for exploring an integrated data source. • Integrated data source construction process may be time consuming. • Less critical path. • Done less frequently than the data queries from HDFS/Hive using Drill.
  • 14. 14 Conclusion • A novel platform for integrating multiple data sources. • Without a priori knowledge of the data models of the sources that are being integrated. • Indices to do the actual integration • Enables parallelizing the push of the actual data into HDFS. • Apache Drill as a fast query execution engine that supports SQL. • Currently ingesting data from TCGA.
  • 15. 15 Current State and Future Plans • Ongoing efforts to evaluate the platform with diverse and heterogeneous data sources. • Expanding to a larger multi-node distributed cluster. • Integration with DataScope. • Multiple data stores and larger data sets. • Integration with imaging clients such as caMicroscope, as well as archives such as The Cancer Imaging Archive (TCIA).
  • 16. Acknowledgements Google Summer of Code 2015 NCIP/Leidos 14X138, caMicroscope — A Digital Pathology Integrative Query System; Ashish Sharma PI Emory/WUSTL/Stony Brook NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory) The results published here are in part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/
  • 17. For more information including recent updates please visit: www.sharmalab.info ashish.sharma@emory.edu