SlideShare uma empresa Scribd logo
1 de 40
Introduction to Spark
Introductions
Garrett Young (sgyoung@us.ibm.com)
1) Introduction to Spark (10 mins)
2) IBM's Commitment to Spark (5 mins)
3) How Predictive Analytic Lifecycles Typically Work (10 mins)
3) Using Spark to Predict Hospital Readmissions (15 mins)
4) How you can get a free-trial Spark environment from IBM (5 mins)
5) Q&A (15 mins)
What is Spark?
• In-memory data processing engine
• Open Source Apache Project
• Cluster Computing Framework
• Can use Scala, Python or R Languages
• Horizontally/Vertically Scalable
• Not a data store
IBM | SPARK – The Analytics Operating System
“Enabling New Classes of Intelligent Applications Embedded with Analytics”
• Spark unifies data, enabling
real-time insights
• Spark processes and analyzes data
from any data source
• Spark is complementary to Hadoop,
but faster with
in-memory performance
• Build models quickly. Iterate faster.
Apply intelligence .
• Traditional Approach: MapReduce jobs for complex jobs, interactive query, and
online event-hub processing involves lots of (slow) disk I/O
How Spark Works
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
Input ResultCPU
Iteration 1
Memory CPU
Iteration 2
Memory
• Solution: Keep more data in-memory with a new distributed execution engine
HDFS
Read
Input CPU
Iteration 1
Memory CPU
Iteration 2
Memory
faster than
network & disk
Zero
Read/Write
Disk
Bottleneck
How Spark Works
Chain Job
Output
into New Job
Input
General Spark Architecture Overview
• Driver Uses Spark
Context to talk to the
Cluster Manager
• Executors run their own
JVM Processes
• Cluster manager
distributes the workload
based on information
from the Worker
Key Reasons for the Interest in Spark
Performant  In-memory architecture greatly reduces disk I/O
 Anywhere from 20-100x faster for common
tasks
Productive  Concise and expressive syntax, especially
compared to prior approaches
 Single programming model across a range
of use cases and steps in data lifecycle
 Integrated with common programming
languages – Java, Python, Scala, R
 New tools continually reduce skill barrier for
access (e.g. SQL for analysts)
Leverages existing
investments
 Works well within existing Hadoop
ecosystem
Improves with age  Large and growing community of
contributors continuously improve full analytics
stack and extend capabilities
What is SparkML?
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy.
At a high level, it provides tools such as:
1. ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative
filtering
2. Featurization: feature extraction, transformation, dimensionality reduction, and selection
3. Persistence: saving and load algorithms, models, and Pipelines
4. Utilities: linear algebra, statistics, data handling, etc.
What is scikit-learn?
• Used for Data Mining and Data Analysis
• Open Source
• Various classification, regression and clustering algorithms
Watson Machine Learning
• Uses both Spark ML and Scikit-Learn plus others
• Built on SPSS plaform
• Can pull from many different data sources
• Integrates with DSX (Beta)
Web
Service
Data Access:
• Easily connect to Behind-
the-Firewall and Public
Cloud Data
• Catalogued and
Governed Controls
through Watson Data
Platform
Creating Models:
• Single UI and API for
creating ML Models on
various Runtimes
• Auto-Modelling and
Hyperparameter
Optimization
Web Service:
• Real-time,
Streaming, and
Batch Deployment
• Continuous
Monitoring and
Feedback Loop
Intelligent Apps:
• Integrate ML
models with apps,
websites, etc.
• Continuously
Improve and Adapt
with Self-Learning
IBM DSX Machine Learning
IMS
IBM Machine Learning in Data Science Experience
API for Jupyter Notebooks Wizard GUI
IBM Machine Learning is provisioned by default in Data Science Experience
• Enables Data Scientists to deploy machine learning models as web services
• Single UI for creating, collaborating, deploying, monitoring, and feedback
• Accessible via API, Wizard GUI, and Canvas
IBM's Commitment to Spark
Spark Tech Center (STC): IBM’s
Commitment to Spark
0
100
200
300
400
500
600
700
800
900
1000
Databricks IBM Hortonworks Cloudera Intel IVU Traffic
Technologies
Tencent
Top 7 Contributing Companies to Spark 2.0.0
25,600 Spark LOC
606 Spark JIRAs
253 SystemML JIRAs
64 Speakers at events
… and all that with 1 Team
1.5 Years
Databricks
Hortonworks
Cloudera
Intel
Tencent
NTT
Other
IBM Spark Technology Center – San Francisco, CA
As of March 10, 2016
See what we’re up to …
IBM Spark Technology Center
http://www.spark.tc/blog/
Fixing lot’s of issues reported
by others
Using Spark to Predict Hospital Readmissions &
How Predictive Analytic Lifecycles Typically Work
Reducing Hospital Readmissions with Predictive Analytics
An Example ‘Proof of Concept’ Using Open Data
Outline
Problem
Solution
Details
Results
Summary
Problem
Solution
Details
Results
Summary
Problem
Problem : 30-Day Hospital Readmissions costs $41B
Annually
Source: http://www.hcup-us.ahrq.gov/reports/statbriefs/sb172-Conditions-Readmissions-Payer.pdf
Medicare HRRP – Penalties to Hospitals
Source: Kaiser Family Foundation
http://kff.org/medicare/issue-brief/aiming-for-fewer-hospital-u-turns-the-medicare-hospital-readmission-reduction-program/
Problem
Solution
Details
Results
Summary
Solution
Get Data: Diabetes Readmissions Dataset
• University of California Irvine – Machine Learning Repos.
• Open Data
• 130 Hospitals, 1999-2008
• 101,766 rows, 50 columns of data
• Diabetes Readmissions
• Top ten for Medicaid, Private Insurance and Uninsured
• Not in top ten for Medicare
https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Build a Predictive Model : Conceptual View
 Step 1: Model Development
 Step 2: Perform Predictions
Historical
Data
Machine
Learning
(Mathematical
Algorithm)
Model
Model PredictionNew Case
IBM Bluemix
• Bluemix
• Infrastructure, Watson, software and services on Bluemix Cloud Platform
• Services such as Big Insights (Hadoop), Data Connect (ETL), and Spark can be almost instantly provisioned
Data Science Experience (DSX)
• Data Science Experience (DSX)
• Easily execute scala, python and R notebooks
• Share notebooks with your data science team
Bluemix Services Architecture in the Cloud
BigInsights HDFS
(Hadoop)
Data Connect DashDB
Data Science ExperienceCloudantNode.js Web Form
Training Data Convert to CSV
Predictions
New Records
Predictions
Problem
Solution
Details
Results
Summary
Details
A Look at The Raw Data
Data Science Experience – Python Code
Problem
Solution
Details
Results
Summary
Results
First Pass Results – Are they any good?
 AUC = Area Under the Curve
AUC Score 0.6514
 0.50 = Random Guessing
 1.00 = Perfect Prediction
2nd Pass Results – Are they any good?
 AUC = Area Under the Curve
AUC Score 0.6750
 0.50 = Random Guessing
 1.00 = Perfect Prediction
How Do Other Readmission Models
Perform?
“A comparison of models for predicting early hospital
readmissions”
Journal of Biomedical Informatics Volume 56, August 2015, Pages 229–
238
Source: http://www.sciencedirect.com/science/article/pii/S1532046415000969
Which Factors Affect Diabetes Readmission?
Data: Feature Importance from Random Forest Algorithm
The Algorithm can tell us which features
(columns) it found important during the
training process.
22 columns from original 50
Problem
Solution
Details
Results
Summary
Summary
Summary
• Readmissions Prediction is an important area of research for using
Predictive Analytics in Healthcare
• Patient: Improved Outcome
• Hospital Providers: Avoid Penalties
• Payers: Reduce Costs
• In a short amount of time we were able to develop results comparable
to leading research studies
How you can get a free-trial Spark Cluster from IBM
Sign Up for Free Account
Data Science Experience
with IBM ML
https://ibm.box.com/s/y2zvpzk8pje56lto0oja0372tnbydbomhttp://datascience.ibm.com/
Notebook Samples

Mais conteúdo relacionado

Mais procurados

Scalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2OScalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2O
Sri Ambati
 

Mais procurados (20)

Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Scalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2OScalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2O
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
The Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in SparkThe Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in Spark
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Distributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDLDistributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDL
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara University
 
Dog Breed Classification using PyTorch on Azure Machine Learning
Dog Breed Classification using PyTorch on Azure Machine LearningDog Breed Classification using PyTorch on Azure Machine Learning
Dog Breed Classification using PyTorch on Azure Machine Learning
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 

Semelhante a IBM Strategy for Spark

Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 

Semelhante a IBM Strategy for Spark (20)

2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into DatabricksData Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 

Mais de Mark Kerzner

FreeEed popcorn overview
FreeEed popcorn overviewFreeEed popcorn overview
FreeEed popcorn overview
Mark Kerzner
 
FreeEed presentation
FreeEed presentationFreeEed presentation
FreeEed presentation
Mark Kerzner
 
Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS
Mark Kerzner
 
Porting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpPorting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdp
Mark Kerzner
 
Google Office in Zurich, Switzerland
Google Office in Zurich, SwitzerlandGoogle Office in Zurich, Switzerland
Google Office in Zurich, Switzerland
Mark Kerzner
 
Fun art with fruit and vegetable
Fun art with fruit and vegetableFun art with fruit and vegetable
Fun art with fruit and vegetable
Mark Kerzner
 

Mais de Mark Kerzner (20)

Toorcamp 2016
Toorcamp 2016Toorcamp 2016
Toorcamp 2016
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
 
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupHadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
 
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleHadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - Altiscale
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data edition
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Joe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiJoe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFi
 
FreeEed popcorn overview
FreeEed popcorn overviewFreeEed popcorn overview
FreeEed popcorn overview
 
FreeEed presentation
FreeEed presentationFreeEed presentation
FreeEed presentation
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS
 
SHMcloud vision
SHMcloud visionSHMcloud vision
SHMcloud vision
 
Porting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpPorting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdp
 
Automated Hadoop Cluster Construction on EC2
Automated Hadoop Cluster Construction on EC2Automated Hadoop Cluster Construction on EC2
Automated Hadoop Cluster Construction on EC2
 
Hadoop on ec2
Hadoop on ec2Hadoop on ec2
Hadoop on ec2
 
Open source e_discovery
Open source e_discoveryOpen source e_discovery
Open source e_discovery
 
FreEed - Open Source eDiscovery
FreEed - Open Source eDiscoveryFreEed - Open Source eDiscovery
FreEed - Open Source eDiscovery
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Google Office in Zurich, Switzerland
Google Office in Zurich, SwitzerlandGoogle Office in Zurich, Switzerland
Google Office in Zurich, Switzerland
 
Fun art with fruit and vegetable
Fun art with fruit and vegetableFun art with fruit and vegetable
Fun art with fruit and vegetable
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

IBM Strategy for Spark

  • 2. Introductions Garrett Young (sgyoung@us.ibm.com) 1) Introduction to Spark (10 mins) 2) IBM's Commitment to Spark (5 mins) 3) How Predictive Analytic Lifecycles Typically Work (10 mins) 3) Using Spark to Predict Hospital Readmissions (15 mins) 4) How you can get a free-trial Spark environment from IBM (5 mins) 5) Q&A (15 mins)
  • 3. What is Spark? • In-memory data processing engine • Open Source Apache Project • Cluster Computing Framework • Can use Scala, Python or R Languages • Horizontally/Vertically Scalable • Not a data store
  • 4. IBM | SPARK – The Analytics Operating System “Enabling New Classes of Intelligent Applications Embedded with Analytics” • Spark unifies data, enabling real-time insights • Spark processes and analyzes data from any data source • Spark is complementary to Hadoop, but faster with in-memory performance • Build models quickly. Iterate faster. Apply intelligence .
  • 5. • Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O How Spark Works HDFS Read HDFS Write HDFS Read HDFS Write Input ResultCPU Iteration 1 Memory CPU Iteration 2 Memory
  • 6. • Solution: Keep more data in-memory with a new distributed execution engine HDFS Read Input CPU Iteration 1 Memory CPU Iteration 2 Memory faster than network & disk Zero Read/Write Disk Bottleneck How Spark Works Chain Job Output into New Job Input
  • 7. General Spark Architecture Overview • Driver Uses Spark Context to talk to the Cluster Manager • Executors run their own JVM Processes • Cluster manager distributes the workload based on information from the Worker
  • 8. Key Reasons for the Interest in Spark Performant  In-memory architecture greatly reduces disk I/O  Anywhere from 20-100x faster for common tasks Productive  Concise and expressive syntax, especially compared to prior approaches  Single programming model across a range of use cases and steps in data lifecycle  Integrated with common programming languages – Java, Python, Scala, R  New tools continually reduce skill barrier for access (e.g. SQL for analysts) Leverages existing investments  Works well within existing Hadoop ecosystem Improves with age  Large and growing community of contributors continuously improve full analytics stack and extend capabilities
  • 9. What is SparkML? MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: 1. ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering 2. Featurization: feature extraction, transformation, dimensionality reduction, and selection 3. Persistence: saving and load algorithms, models, and Pipelines 4. Utilities: linear algebra, statistics, data handling, etc.
  • 10. What is scikit-learn? • Used for Data Mining and Data Analysis • Open Source • Various classification, regression and clustering algorithms
  • 11. Watson Machine Learning • Uses both Spark ML and Scikit-Learn plus others • Built on SPSS plaform • Can pull from many different data sources • Integrates with DSX (Beta)
  • 12. Web Service Data Access: • Easily connect to Behind- the-Firewall and Public Cloud Data • Catalogued and Governed Controls through Watson Data Platform Creating Models: • Single UI and API for creating ML Models on various Runtimes • Auto-Modelling and Hyperparameter Optimization Web Service: • Real-time, Streaming, and Batch Deployment • Continuous Monitoring and Feedback Loop Intelligent Apps: • Integrate ML models with apps, websites, etc. • Continuously Improve and Adapt with Self-Learning IBM DSX Machine Learning IMS
  • 13. IBM Machine Learning in Data Science Experience API for Jupyter Notebooks Wizard GUI IBM Machine Learning is provisioned by default in Data Science Experience • Enables Data Scientists to deploy machine learning models as web services • Single UI for creating, collaborating, deploying, monitoring, and feedback • Accessible via API, Wizard GUI, and Canvas
  • 15. Spark Tech Center (STC): IBM’s Commitment to Spark 0 100 200 300 400 500 600 700 800 900 1000 Databricks IBM Hortonworks Cloudera Intel IVU Traffic Technologies Tencent Top 7 Contributing Companies to Spark 2.0.0 25,600 Spark LOC 606 Spark JIRAs 253 SystemML JIRAs 64 Speakers at events … and all that with 1 Team 1.5 Years Databricks Hortonworks Cloudera Intel Tencent NTT Other
  • 16. IBM Spark Technology Center – San Francisco, CA As of March 10, 2016 See what we’re up to … IBM Spark Technology Center http://www.spark.tc/blog/ Fixing lot’s of issues reported by others
  • 17. Using Spark to Predict Hospital Readmissions & How Predictive Analytic Lifecycles Typically Work
  • 18. Reducing Hospital Readmissions with Predictive Analytics An Example ‘Proof of Concept’ Using Open Data
  • 21. Problem : 30-Day Hospital Readmissions costs $41B Annually Source: http://www.hcup-us.ahrq.gov/reports/statbriefs/sb172-Conditions-Readmissions-Payer.pdf
  • 22. Medicare HRRP – Penalties to Hospitals Source: Kaiser Family Foundation http://kff.org/medicare/issue-brief/aiming-for-fewer-hospital-u-turns-the-medicare-hospital-readmission-reduction-program/
  • 24. Get Data: Diabetes Readmissions Dataset • University of California Irvine – Machine Learning Repos. • Open Data • 130 Hospitals, 1999-2008 • 101,766 rows, 50 columns of data • Diabetes Readmissions • Top ten for Medicaid, Private Insurance and Uninsured • Not in top ten for Medicare https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
  • 25. Build a Predictive Model : Conceptual View  Step 1: Model Development  Step 2: Perform Predictions Historical Data Machine Learning (Mathematical Algorithm) Model Model PredictionNew Case
  • 26. IBM Bluemix • Bluemix • Infrastructure, Watson, software and services on Bluemix Cloud Platform • Services such as Big Insights (Hadoop), Data Connect (ETL), and Spark can be almost instantly provisioned
  • 27. Data Science Experience (DSX) • Data Science Experience (DSX) • Easily execute scala, python and R notebooks • Share notebooks with your data science team
  • 28. Bluemix Services Architecture in the Cloud BigInsights HDFS (Hadoop) Data Connect DashDB Data Science ExperienceCloudantNode.js Web Form Training Data Convert to CSV Predictions New Records Predictions
  • 30. A Look at The Raw Data
  • 31. Data Science Experience – Python Code
  • 33. First Pass Results – Are they any good?  AUC = Area Under the Curve AUC Score 0.6514  0.50 = Random Guessing  1.00 = Perfect Prediction
  • 34. 2nd Pass Results – Are they any good?  AUC = Area Under the Curve AUC Score 0.6750  0.50 = Random Guessing  1.00 = Perfect Prediction
  • 35. How Do Other Readmission Models Perform? “A comparison of models for predicting early hospital readmissions” Journal of Biomedical Informatics Volume 56, August 2015, Pages 229– 238 Source: http://www.sciencedirect.com/science/article/pii/S1532046415000969
  • 36. Which Factors Affect Diabetes Readmission? Data: Feature Importance from Random Forest Algorithm The Algorithm can tell us which features (columns) it found important during the training process. 22 columns from original 50
  • 38. Summary • Readmissions Prediction is an important area of research for using Predictive Analytics in Healthcare • Patient: Improved Outcome • Hospital Providers: Avoid Penalties • Payers: Reduce Costs • In a short amount of time we were able to develop results comparable to leading research studies
  • 39. How you can get a free-trial Spark Cluster from IBM
  • 40. Sign Up for Free Account Data Science Experience with IBM ML https://ibm.box.com/s/y2zvpzk8pje56lto0oja0372tnbydbomhttp://datascience.ibm.com/ Notebook Samples

Notas do Editor

  1. 3