SlideShare uma empresa Scribd logo
1 de 29
Vamshi Punugoti & Bryan Lari
MD Anderson Cancer Center
June 2016
HDP @ MD ANDERSON
Starting the Hadoop Journey at a Global Leader
in Cancer Research
Agenda
• About MD Anderson
• Big Data Program
• Our Hadoop Implementation
• Lessons Learned
• Next Steps
• Who we are
– One of the worlds largest centers devoted exclusively to cancer care
– Created by the Texas legislature in 1941
– Named one of the nation's top two hospitals for cancer care every
year since the survey began in 1990
• Mission
– MD Anderson’s mission is to eliminate cancer in Texas, the nation and
the world through exceptional programs that integrate patient care,
research and prevention.
About MD Anderson
About MD Anderson cont.
Patient Care Education
Research
Moon Shots Program
• Launched in 2012 – to make a giant leap for patients
• Accelerating the pace of converting scientific discoveries into
clinical advances that reduce cancer deaths
• Transdisciplinary team-science approach
• Transformative professional platforms
List of Moon Shots
12 Total Moon Shots
B-cell Lymphoma Lung Cancer
Breast Cancer Melanoma
Colorectal Cancer Multiple Myeloma
Glioblastoma Ovarian Cancer
HPV-Related Cancers Pancreatic Cancer
Leukemia (CLL, MDS, AML) Prostate Cancer
http://www.cancermoonshots.org
Volume
Variety
Velocity
Veracity
Gulf of Mexico Analogy
Goals of Big Data Program
• Data driven organization
• All “types” of data
• “Access” for all customers
• Clinicians
• Researchers
• Administrative / Operational
• Enable discovery of “insights”
• Improve patient care
• Increase research discoveries
• Improve operations
• Govern data like an asset
• Provide a platform / environment to enable all these things
To provide the right information to the right people at the right time with the right tools
Goal
data
insight
Insights
Make big data additive and build upon foundation
What are we doing today?
• FIRE Enterprise Data Warehouse
• Natural Language Processing (NLP)
• Data Governance
• Hadoop NoSQL
• Cognitive Computing
• Data Visualization
• Evolving our Platform / Architecture
• Identifying big data use cases
• Training & Skills
• Federated Institutional Reporting Environment
• Centralized data repository supporting analytics,
decision making, and business intelligence
• Central repository for historical and operational data
• Break-down data silos
Enterprise
RepositorySource Systems
Dashboards
KPI’s
Analytic
Reports
Analytics
& Reporting
Discoveries
Improve
Patient Care
Quality / Perf
Improvements
Genomic
FIRE Program
Radiology
Labs
Epic / Clarity
Legacy Systems
• Vast amounts of unstructured data are
stored on MDACC servers.
• Conventional ETL tools are not designed
to mine unstructured data.
• Suite of tools make up the NLP Pipeline
• Dictionaries were created to help Epic
go-live (Provider Friendly Terminology)
• Other examples:
• Diagnosis from the pathology reports
• Comorbidities
• Family Cancer History
• Cytogenetics
• Obituary text
• ICD10 Coding
• Structured results feeding Moonshot TRA and OEA
• Etc.
IBM ECM
NLP
Engine
Unstructured Data
Sources
Post NLP
Database
HDWF
(FIRE)
NLP Pipeline - Overview
Enterprise
Business
Clinical Big Data
Peoplesoft
Systems of Record
Systems of Reporting
Systems of Insights
Kronos
Point of Sale
Volunteer Services
Rotary House
MyHR
UTPD
Facilities
Clinic Station
Epic
Lab
GE IDX
Cerner
CARE
EPM
Hyperion
Oracle Business Intelligence
Smart View
Web Analytics
FIRE
EIW
Business Objects
Crystal
Hyperion Interactive Reporting
Facebook
Twitter
UPS
Center for Disease Control
The Weather Channel
LinkedIN
Youtube
oracle.com
Yelp!
Reuters
Google
U.S. Census
Medical Devices
Medical Equipment
Building Controls
Campus Video
Real-time Location Service
Wayfinding
Data
Visualization
Ad Hoc
Cognitive
Computing
Big Data for Analytics & Cognitive Computing
Presentation
Cohort Explorer
Parking Garages
Pharmacy
Research
LCDR
Melcore
Gemini
IPCT
Data Governance
Data Stewardship
Data Portal
Data Profiling
and Quality
Data
Standardization
Compliance
Metadata and
Business
Glossary
Master Data
Management
Data
Repository
Dashboards
KPI’s
Analytic
Reports
Analytics & Informatics
Discoveries
Improve
Patient Care
Quality / Perf
Improvements
Data Mgt & Operations
Data Lake
Data Discovery
Profiling
Standards / Quality
Big Data (Structured and NoSQL)
Insight Apps
Genomic
Radiology
Labs
Epic / Clarity
Legacy Systems
Big Data – High Level
Big Data Technical Architecture
Our Hadoop Implementation
Our Hadoop Implementation cont.
Our Hadoop Implementation cont.
Average number of messages per day: 1,556,688
Estimated amount of storage increase per day: 5.7 GB
Number of channels currently being used: 24
Estimated daily message processing capacity: 4,320,000
Our Hadoop Implementation cont.
Medical Device Data Flow
Data Source Data Capture MDA Big DataData Lake
Access Portals
(Analytics/Visualization)
Integration HUB Data ingestion
Processing
Channels
HBase
Data Loader
Capsule
Capsule
DB
Medical
Device
End-Users
FIRE/Big Data
Cloverleaf
Engine
Epic
TCP-based Data
Listener - Flume
HIVE
PIG
HUNK
Sqoop
Validated HL7
with Patient ID
(from Epic)
HL7
Raw HL7
(from Capsule)
Cleanse &
Transform
Raw HL7
Validated HL7
Our Hadoop Implementation cont.
Developer
Workstation/Sandbox
SVN
(source control server)
Bamboo
(build server)
HDP Dev Cluster
HDP QA Cluster
HDP Prod Cluster
Daily Checkin/Checkout
Development Cycle
On Dev Lead Approval:
Build, Unit Test, Deploy & Tag
On Successful UAT
& Release Approval:
Deploy Per
Last Successful
Build Tag
Smoke Test
Before Updating Task status
Periodic Integration & Validation:
Build, Unit Test
& Notify On Error
Development
Cycle
Deployment
Cycle
process
1. It’s complex
2. It’s a journey
3. Leverage existing strengths
4. Collaborate openly
5. Learn from experts
6. One cluster – multiple use cases
7. Follow best practices
Lessons Learned – what went well
people
1. Continue to expand/evolve our platform
2. Ingest more data and data types
3. Identify high value use cases
4. Develop/Train people with new skills
Next Steps
Train People with new Skills
Accessing data
Computing data
Visualizing data
Insights &
Cognitive Computing
Starting the Hadoop Journey at a Global Leader in Cancer Research

Mais conteúdo relacionado

Mais procurados

Maven and google pharma r&d (1)
Maven and google pharma r&d  (1)Maven and google pharma r&d  (1)
Maven and google pharma r&d (1)
Matt Barnes
 
Reveal - An Enterprise Clinical Data Search Solution
Reveal - An Enterprise Clinical Data Search SolutionReveal - An Enterprise Clinical Data Search Solution
Reveal - An Enterprise Clinical Data Search Solution
d-Wise Technologies
 
Big data and the Healthcare Sector
Big data and the Healthcare Sector Big data and the Healthcare Sector
Big data and the Healthcare Sector
Chris Groves
 

Mais procurados (20)

Hadoop in Healthcare Systems
Hadoop in Healthcare SystemsHadoop in Healthcare Systems
Hadoop in Healthcare Systems
 
BIG Data & Hadoop Applications in Healthcare
BIG Data & Hadoop Applications in HealthcareBIG Data & Hadoop Applications in Healthcare
BIG Data & Hadoop Applications in Healthcare
 
Hadoop Enabled Healthcare
Hadoop Enabled HealthcareHadoop Enabled Healthcare
Hadoop Enabled Healthcare
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Maven and google pharma r&d (1)
Maven and google pharma r&d  (1)Maven and google pharma r&d  (1)
Maven and google pharma r&d (1)
 
Health care and big data with hadoop – Beacuse prevention is better than cure
Health care and big data with hadoop – Beacuse prevention is better than cureHealth care and big data with hadoop – Beacuse prevention is better than cure
Health care and big data with hadoop – Beacuse prevention is better than cure
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
 
1 PSUT Big Data Class, introduction
1 PSUT Big Data Class,  introduction1 PSUT Big Data Class,  introduction
1 PSUT Big Data Class, introduction
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
 
Data-driven Healthcare
Data-driven HealthcareData-driven Healthcare
Data-driven Healthcare
 
Real-Time Clinical Analytics
Real-Time Clinical AnalyticsReal-Time Clinical Analytics
Real-Time Clinical Analytics
 
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
 
Using Semantic Technology to Drive Agile Analytics - SLIDES
Using Semantic Technology to Drive Agile Analytics - SLIDESUsing Semantic Technology to Drive Agile Analytics - SLIDES
Using Semantic Technology to Drive Agile Analytics - SLIDES
 
Reveal - An Enterprise Clinical Data Search Solution
Reveal - An Enterprise Clinical Data Search SolutionReveal - An Enterprise Clinical Data Search Solution
Reveal - An Enterprise Clinical Data Search Solution
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source Technologies
 
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
 
Big data and the Healthcare Sector
Big data and the Healthcare Sector Big data and the Healthcare Sector
Big data and the Healthcare Sector
 

Destaque (6)

Paul Capello Cerner Health Conference
Paul Capello Cerner Health ConferencePaul Capello Cerner Health Conference
Paul Capello Cerner Health Conference
 
SmartSense Suite
SmartSense SuiteSmartSense Suite
SmartSense Suite
 
UCSF Informatics Day 2014 - Doug Berman, "A Brief Tour of UCSF’s Clinical Dat...
UCSF Informatics Day 2014 - Doug Berman, "A Brief Tour of UCSF’s Clinical Dat...UCSF Informatics Day 2014 - Doug Berman, "A Brief Tour of UCSF’s Clinical Dat...
UCSF Informatics Day 2014 - Doug Berman, "A Brief Tour of UCSF’s Clinical Dat...
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Knowledge from Noise
Knowledge from Noise Knowledge from Noise
Knowledge from Noise
 
Keep your Hadoop Cluster at its Best
Keep your Hadoop Cluster at its BestKeep your Hadoop Cluster at its Best
Keep your Hadoop Cluster at its Best
 

Semelhante a Starting the Hadoop Journey at a Global Leader in Cancer Research

Challenges in Clinical Research: Aridhia's Disruptive Technology Approach to ...
Challenges in Clinical Research: Aridhia's Disruptive Technology Approach to ...Challenges in Clinical Research: Aridhia's Disruptive Technology Approach to ...
Challenges in Clinical Research: Aridhia's Disruptive Technology Approach to ...
Aridhia Informatics Ltd
 
Building an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-MakingBuilding an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-Making
Denodo
 
2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine
Michael Atkins
 

Semelhante a Starting the Hadoop Journey at a Global Leader in Cancer Research (20)

Challenges in Clinical Research: Aridhia Disrupts Technology Approach to Rese...
Challenges in Clinical Research: Aridhia Disrupts Technology Approach to Rese...Challenges in Clinical Research: Aridhia Disrupts Technology Approach to Rese...
Challenges in Clinical Research: Aridhia Disrupts Technology Approach to Rese...
 
Challenges in Clinical Research: Aridhia's Disruptive Technology Approach to ...
Challenges in Clinical Research: Aridhia's Disruptive Technology Approach to ...Challenges in Clinical Research: Aridhia's Disruptive Technology Approach to ...
Challenges in Clinical Research: Aridhia's Disruptive Technology Approach to ...
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Building an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-MakingBuilding an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-Making
 
Big data analystics
Big data analysticsBig data analystics
Big data analystics
 
Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204
Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204
Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
 
2015 04-18-wilson cg
2015 04-18-wilson cg2015 04-18-wilson cg
2015 04-18-wilson cg
 
The Role of Data Lakes in Healthcare
The Role of Data Lakes in HealthcareThe Role of Data Lakes in Healthcare
The Role of Data Lakes in Healthcare
 
Big Data in Pediatric Critical Care by Mohit Mehra
Big Data in Pediatric Critical Care by Mohit MehraBig Data in Pediatric Critical Care by Mohit Mehra
Big Data in Pediatric Critical Care by Mohit Mehra
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Supporting a Collaborative R&D Organization with a Dynamic Big Data Solution
Supporting a Collaborative R&D Organization with a Dynamic Big Data SolutionSupporting a Collaborative R&D Organization with a Dynamic Big Data Solution
Supporting a Collaborative R&D Organization with a Dynamic Big Data Solution
 
2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine
 
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons ...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons ...tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons ...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons ...
 
Bridging Health Care and Clinical Trial Data through Technology
Bridging Health Care and Clinical Trial Data through TechnologyBridging Health Care and Clinical Trial Data through Technology
Bridging Health Care and Clinical Trial Data through Technology
 
The Data Operating System: Changing the Digital Trajectory of Healthcare
The Data Operating System: Changing the Digital Trajectory of HealthcareThe Data Operating System: Changing the Digital Trajectory of Healthcare
The Data Operating System: Changing the Digital Trajectory of Healthcare
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for Healthcare
 
Big Data in Clinical Research
Big Data in Clinical ResearchBig Data in Clinical Research
Big Data in Clinical Research
 

Mais de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Starting the Hadoop Journey at a Global Leader in Cancer Research

  • 1. Vamshi Punugoti & Bryan Lari MD Anderson Cancer Center June 2016 HDP @ MD ANDERSON Starting the Hadoop Journey at a Global Leader in Cancer Research
  • 2. Agenda • About MD Anderson • Big Data Program • Our Hadoop Implementation • Lessons Learned • Next Steps
  • 3. • Who we are – One of the worlds largest centers devoted exclusively to cancer care – Created by the Texas legislature in 1941 – Named one of the nation's top two hospitals for cancer care every year since the survey began in 1990 • Mission – MD Anderson’s mission is to eliminate cancer in Texas, the nation and the world through exceptional programs that integrate patient care, research and prevention. About MD Anderson
  • 4. About MD Anderson cont. Patient Care Education Research
  • 5. Moon Shots Program • Launched in 2012 – to make a giant leap for patients • Accelerating the pace of converting scientific discoveries into clinical advances that reduce cancer deaths • Transdisciplinary team-science approach • Transformative professional platforms List of Moon Shots 12 Total Moon Shots B-cell Lymphoma Lung Cancer Breast Cancer Melanoma Colorectal Cancer Multiple Myeloma Glioblastoma Ovarian Cancer HPV-Related Cancers Pancreatic Cancer Leukemia (CLL, MDS, AML) Prostate Cancer http://www.cancermoonshots.org
  • 6.
  • 8. Gulf of Mexico Analogy
  • 9. Goals of Big Data Program • Data driven organization • All “types” of data • “Access” for all customers • Clinicians • Researchers • Administrative / Operational • Enable discovery of “insights” • Improve patient care • Increase research discoveries • Improve operations • Govern data like an asset • Provide a platform / environment to enable all these things
  • 10. To provide the right information to the right people at the right time with the right tools Goal data insight
  • 12. Make big data additive and build upon foundation
  • 13. What are we doing today? • FIRE Enterprise Data Warehouse • Natural Language Processing (NLP) • Data Governance • Hadoop NoSQL • Cognitive Computing • Data Visualization • Evolving our Platform / Architecture • Identifying big data use cases • Training & Skills
  • 14. • Federated Institutional Reporting Environment • Centralized data repository supporting analytics, decision making, and business intelligence • Central repository for historical and operational data • Break-down data silos Enterprise RepositorySource Systems Dashboards KPI’s Analytic Reports Analytics & Reporting Discoveries Improve Patient Care Quality / Perf Improvements Genomic FIRE Program Radiology Labs Epic / Clarity Legacy Systems
  • 15. • Vast amounts of unstructured data are stored on MDACC servers. • Conventional ETL tools are not designed to mine unstructured data. • Suite of tools make up the NLP Pipeline • Dictionaries were created to help Epic go-live (Provider Friendly Terminology) • Other examples: • Diagnosis from the pathology reports • Comorbidities • Family Cancer History • Cytogenetics • Obituary text • ICD10 Coding • Structured results feeding Moonshot TRA and OEA • Etc. IBM ECM NLP Engine Unstructured Data Sources Post NLP Database HDWF (FIRE) NLP Pipeline - Overview
  • 16. Enterprise Business Clinical Big Data Peoplesoft Systems of Record Systems of Reporting Systems of Insights Kronos Point of Sale Volunteer Services Rotary House MyHR UTPD Facilities Clinic Station Epic Lab GE IDX Cerner CARE EPM Hyperion Oracle Business Intelligence Smart View Web Analytics FIRE EIW Business Objects Crystal Hyperion Interactive Reporting Facebook Twitter UPS Center for Disease Control The Weather Channel LinkedIN Youtube oracle.com Yelp! Reuters Google U.S. Census Medical Devices Medical Equipment Building Controls Campus Video Real-time Location Service Wayfinding Data Visualization Ad Hoc Cognitive Computing Big Data for Analytics & Cognitive Computing Presentation Cohort Explorer Parking Garages Pharmacy Research LCDR Melcore Gemini IPCT
  • 17. Data Governance Data Stewardship Data Portal Data Profiling and Quality Data Standardization Compliance Metadata and Business Glossary Master Data Management
  • 18. Data Repository Dashboards KPI’s Analytic Reports Analytics & Informatics Discoveries Improve Patient Care Quality / Perf Improvements Data Mgt & Operations Data Lake Data Discovery Profiling Standards / Quality Big Data (Structured and NoSQL) Insight Apps Genomic Radiology Labs Epic / Clarity Legacy Systems
  • 19. Big Data – High Level
  • 20. Big Data Technical Architecture
  • 23. Our Hadoop Implementation cont. Average number of messages per day: 1,556,688 Estimated amount of storage increase per day: 5.7 GB Number of channels currently being used: 24 Estimated daily message processing capacity: 4,320,000
  • 24. Our Hadoop Implementation cont. Medical Device Data Flow Data Source Data Capture MDA Big DataData Lake Access Portals (Analytics/Visualization) Integration HUB Data ingestion Processing Channels HBase Data Loader Capsule Capsule DB Medical Device End-Users FIRE/Big Data Cloverleaf Engine Epic TCP-based Data Listener - Flume HIVE PIG HUNK Sqoop Validated HL7 with Patient ID (from Epic) HL7 Raw HL7 (from Capsule) Cleanse & Transform Raw HL7 Validated HL7
  • 25. Our Hadoop Implementation cont. Developer Workstation/Sandbox SVN (source control server) Bamboo (build server) HDP Dev Cluster HDP QA Cluster HDP Prod Cluster Daily Checkin/Checkout Development Cycle On Dev Lead Approval: Build, Unit Test, Deploy & Tag On Successful UAT & Release Approval: Deploy Per Last Successful Build Tag Smoke Test Before Updating Task status Periodic Integration & Validation: Build, Unit Test & Notify On Error Development Cycle Deployment Cycle
  • 26. process 1. It’s complex 2. It’s a journey 3. Leverage existing strengths 4. Collaborate openly 5. Learn from experts 6. One cluster – multiple use cases 7. Follow best practices Lessons Learned – what went well people
  • 27. 1. Continue to expand/evolve our platform 2. Ingest more data and data types 3. Identify high value use cases 4. Develop/Train people with new skills Next Steps
  • 28. Train People with new Skills Accessing data Computing data Visualizing data Insights & Cognitive Computing