A podium abstract presented at AMIA 2016 Joint Summits on Translational Science. This discusses Data Café — A Platform For Creating Biomedical Data Lakes.
Local Advanced Esophageal Cancer (T3-4N0-2M0): Artificial Intelligence, Syner...
Data Café — A Platform For Creating Biomedical Data Lakes
1. 1
Data Café — A Platform For Creating
Biomedical Data Lakes
Pradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2
1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
2 Department of Biomedical Informatics, Emory University, Atlanta, USA
www.sharmalab.info
2. 2
Data Landscape
for Precision Medicine
DATA
CHARACTERISTICS
• Large number of small datasets
• Structured…Semi-structured
…Unstructured…Ill formed
• Noisy and Fuzzy/Uncertain
• Spatial, Temporal relationships
DATA MANAGEMENT
• Variety in storage and messaging
protocols
• No shared interface
3. 3
Illustrative Use Case
Execute a Radiogenomics workflow on the diffusion images of GBM
patients who received a TMZ + experimental regimen with an overall
survival of 18months or more.
Execute a Radiogenomics workflow on the diffusion images of GBM
patients who received a TMZ + experimental regimen with an overall
survival of 18months or more
PACS + EMR + AIM + RT + Molecular
4. 4
Motivation
• Most current solutions require a DBA to initiate the migration of data into
a Data Warehousing environment
• to query and explore all the data at once.
• Costly to set up such warehouses.
• Unified warehouse with access to query and explore the data.
• Limitations
• Scalability and extensibility to incorporate new data sources
• A priori knowledge of the data models of the different data sources.
5. BIOMEDICAL DATA LAKES
• Cohort Discovery and Creation — Assembled per-study
• Heterogeneous data collected in a loosely structured fashion.
• Agile and easy to create.
• Integrate with data exploration/visualization via REST APIs.
• Problem or hypothesis specific virtual data set.
• Powered by Drill + HDFS, Data Sources via APIs.
6. 6
Data Café
• An agile approach to creating and extending the concept of a star
schema
• to model a problem/hypothesis specific dataset.
• by leveraging Apache Drill to easily query the data.
• Tackles the limitations in the existing approaches.
• Provides researchers the ability to add new data models and sources.
7. 7
Core Concepts
Step 1. Given a set of data sources,
create a graphical representation of
the join attributes.
This graph represents how data is
connected across the various data
sources
8. 8
Core Concepts
Step 2. Run a set of parallel queries on
the data sources that include the
attributes that are present in the
query graph.
In the top figure, our query is of type:
{id1: A1 > x and B2 == y}
We run similar queries across C, D and
E and retrieve the set of relevant id’s
(join attributes).
9. 9
Core Concepts
Step 3. Compute intersection across
the various id’s (join attributes). The
data of interest can now be obtained
using the id’s in this intersection.
A subsequent query will allow us to
stream, in parallel, data from
individual sources, given the relevant
ids (join attributes)
11. 11
Apache Drill
• Variety – Query a range of non-relational data sources.
• Flexibility.
• Agility – Faster Insights.
• Scalability.
12. 12
Evaluation Environment
• Data Café was deployed along with the data sources and Drill in Amazon
EC2.
• MongoDB instantiated in EC2 instances.
• Hive on Amazon EMR (Elastic MapReduce).
• EMR HDFS was configured with 3 nodes.
• Various datasets for evaluation
• Two synthetic datasets.
• Clinical Data from the TCGA BRCA collection
13. 13
Results
• Quick creation of data lakes
• without prior knowledge of the data schema.
• Very fast execution of large queries
• with Apache Drill.
• Data Café can be an efficient platform for exploring an integrated data
source.
• Integrated data source construction process may be time consuming.
• Less critical path.
• Done less frequently than the data queries from HDFS/Hive using Drill.
14. 14
Conclusion
• A novel platform for integrating multiple data sources.
• Without a priori knowledge of the data models of the sources that are being
integrated.
• Indices to do the actual integration
• Enables parallelizing the push of the actual data into HDFS.
• Apache Drill as a fast query execution engine that supports SQL.
• Currently ingesting data from TCGA.
15. 15
Current State and Future Plans
• Ongoing efforts to evaluate the platform with diverse and heterogeneous data
sources.
• Expanding to a larger multi-node distributed cluster.
• Integration with DataScope.
• Multiple data stores and larger data sets.
• Integration with imaging clients such as caMicroscope, as well as archives such
as The Cancer Imaging Archive (TCIA).
16. Acknowledgements
Google Summer of Code 2015
NCIP/Leidos 14X138, caMicroscope
— A Digital Pathology Integrative
Query System; Ashish Sharma PI
Emory/WUSTL/Stony Brook
NCI U01 [1U01CA187013-01],
Resources for development and
validation of Radiomic Analyses &
Adaptive Therapy, Fred Prior, Ashish
Sharma (UAMS, Emory)
The results published here are in part
based upon data generated by the
TCGA Research Network:
http://cancergenome.nih.gov/