SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Big Data for the rest of us
Lawrence Spracklen
SupportLogic
lawrence@supportlogic.io
www.linkedin.com/in/spracklen
SupportLogic
• Extract Signals from enterprise CRM systems
• Applied machine learning
• Complete vertical solution
• Go-live in days!
• We are hiring!
@Scale 2018
• Sound like your Big Data problems?
• This is Extreme data!
• Do these solutions help or hinder Big Data for the rest of us?
“Exabytes of data…..”
“1500 manual labelers…..”
“Sub second global propagation of likes…..”
End-2-End Planning
• Numerous steps/obstacles to successfully leveraging ML
• Data Acquisition
• Data Cleansing
• Feature Engineering
• Model Selection and Training
• Model Optimization
• Model Deployment
• Model Feedback and Retraining
• Import to consider all steps before deciding on an approach
• Upstream decisions can severely limit downstream options
ML Landscape
• How do I build a successful production-grade solution from all these
disparate components that don’t play well together?
Data Set Availability
• Is the necessary data available?
• Are there HIPAA, PII, GDPR concerns?
• Is it spread across multiple systems?
• Can the systems communicate?
• Data fusion
• Move the compute to the data…
• Legacy infrastructure decisions can dictate optimal approach
Feature Engineering
• Essential for model performance, efficacy, robustness and simplicity
• Feature extraction
• Feature selection
• Feature construction
• Feature elimination
• Dimensionality reduction
• Traditionally a laborious manual process
• Automation techniques becoming available
• e.g. TransmogrifAI, Featuretools
• Leverage feature stores!
Model Training
• Big differences in the range of algorithms offered by different
frameworks
• Don’t just jump to the most complex!
• Easy to automate selection process
• Just click ‘go’
• Automate hyperparameter optimization
• Beyond the nested for-loop!
Model Ops
• What happens after the models are created?
• How does the business benefit from the insights
• Operationalization is frequently the weak link
• Operationalizing PowerPoint?
• Hand rolled scoring flows?
Barriers to Model Ops
• Scoring often performed on a different data platform to training
• Framework specific persistence formats
• Complex data preprocessing requirements
• Data cleansing and feature engineering
• Batch training versus RT/stream scoring
• How frequently are models updated?
• How is performance monitored?
Typical Deployments
PMML & PFA
• PMML has been long available as framework agnostic model
representation
• Frequently requires helper scripts
• PFA is the potential successor….
• Addresses lots of PMML’s shortcomings
• Scoring engines accepting R or Python scripts
• Easy to use AWS Lambda!
Interpreting Models
• A prediction without an explanation limits its value
• Why is this outcome being predicted?
• What action should be taken as a result?
• Avoid ML models that are “black Boxes”
• Tools for providing prediction explanations are emerging
• E.g. LIME
Example LIME output
Prototype in Python
• Explore the space!
• Work through the end-2-end solution
• Don’t prematurely optimize
• Great Python tooling
• e.g. Juypter Notebooks, Cloudera Data Science workbench
• Don’t let the data leak to laptops!
Python is slow
• Python is simple, flexible and has massive available
functionality
• Pure Python typically hundreds of times slower than C
• Many Python implementations leverage C under-the-hood
• Even naive Scala or Java implementations are slow
1000X faster….
Everything Python
• Python wrappers are available for most packages
• Even momentum in Spark is moving to Python
• Wrappers for C++ libraries like Shogun
Spark
• Optimizing for speed, data size or both?
• Increasingly rich set of ML algorithms
• Still missing common algorithms
• E.g. Multiclass GBTs
• Not all OSS implementations are good
• Hard to correctly resource Spark jobs
• Autotuning systems available
System Sizing
• Why go multi-node?
• CPU or Memory constraints
• Aggregate data size is very different from the size of the individual data sets
• A Data lake can contain Petabytes, but each dataset may be only 10’s of GB….
• Is the raw data bigger or smaller than final data being consumed by the model?
• Spark for ETL
• Is the algorithm itself parallel?
Single Node ML
• Single node memory on even x86 systems can now measure in
tens of terabytes
• Likely to expand further with NVDIMMs
• 40vCPU, ~1TB x86 only $4/hour on Google Cloud
• Many high performance single-node ML libraries exist!
Hive & Postgres
• On Hadoop, many data scientists are constrained to Hive or
Impala for security reasons
• Can be very limiting for ‘real’ data science
• Hivemall for analytics
• Is a traditional DB a better choice?
• Better performance in many instances
• Apache MadLib for analytics
Conclusions
• No one-size fits all!
• Much more to a successful ML project than a cool model
• Not all frameworks play together
• Decisions can limit downstream options
• Need to think about the problem end-2-end
• From data acquisition to model deployment

Mais conteúdo relacionado

Mais procurados

Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017fredverheul
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
2 dirk vermeylen - modeling with neo4 j
2   dirk vermeylen - modeling with neo4 j2   dirk vermeylen - modeling with neo4 j
2 dirk vermeylen - modeling with neo4 jRik Van Bruggen
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
 
Machine learning using spark Online Training
Machine learning using spark Online TrainingMachine learning using spark Online Training
Machine learning using spark Online TrainingLearntek1
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches Rui Romano
 
Introduction to functional programming with JavaScript
Introduction to functional programming with JavaScriptIntroduction to functional programming with JavaScript
Introduction to functional programming with JavaScriptFarzaneh Orak
 
Agile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender SystemsAgile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender SystemsJohann Schleier-Smith
 
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Flavio Clesio
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
 
In search of database nirvana - The challenges of delivering Hybrid Transacti...
In search of database nirvana - The challenges of delivering Hybrid Transacti...In search of database nirvana - The challenges of delivering Hybrid Transacti...
In search of database nirvana - The challenges of delivering Hybrid Transacti...Rohit Jain
 
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!NLJUG
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Provectus
 
Open Data and APIs - DataWeave
Open Data and APIs - DataWeaveOpen Data and APIs - DataWeave
Open Data and APIs - DataWeaveDataWeave
 
Machine learning 101 sit hvr
Machine learning 101 sit hvrMachine learning 101 sit hvr
Machine learning 101 sit hvrfredverheul
 
Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store
Accelerating the ML Lifecycle with an Enterprise-Grade Feature StoreAccelerating the ML Lifecycle with an Enterprise-Grade Feature Store
Accelerating the ML Lifecycle with an Enterprise-Grade Feature StoreDatabricks
 

Mais procurados (20)

Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
2 dirk vermeylen - modeling with neo4 j
2   dirk vermeylen - modeling with neo4 j2   dirk vermeylen - modeling with neo4 j
2 dirk vermeylen - modeling with neo4 j
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Machine learning using spark Online Training
Machine learning using spark Online TrainingMachine learning using spark Online Training
Machine learning using spark Online Training
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
 
Introduction to functional programming with JavaScript
Introduction to functional programming with JavaScriptIntroduction to functional programming with JavaScript
Introduction to functional programming with JavaScript
 
Agile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender SystemsAgile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender Systems
 
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
 
In search of database nirvana - The challenges of delivering Hybrid Transacti...
In search of database nirvana - The challenges of delivering Hybrid Transacti...In search of database nirvana - The challenges of delivering Hybrid Transacti...
In search of database nirvana - The challenges of delivering Hybrid Transacti...
 
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Open Data and APIs - DataWeave
Open Data and APIs - DataWeaveOpen Data and APIs - DataWeave
Open Data and APIs - DataWeave
 
Machine learning 101 sit hvr
Machine learning 101 sit hvrMachine learning 101 sit hvr
Machine learning 101 sit hvr
 
Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store
Accelerating the ML Lifecycle with an Enterprise-Grade Feature StoreAccelerating the ML Lifecycle with an Enterprise-Grade Feature Store
Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store
 

Semelhante a Ideas spracklen-final

Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesNick Pentreath
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha Talagala
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
Storage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningStorage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningNisha Talagala
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Databricks
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?CQD
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
 
Fms invited talk_2018 v5
Fms invited talk_2018 v5Fms invited talk_2018 v5
Fms invited talk_2018 v5Nisha Talagala
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning ProductsAndrew Musselman
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016MLconf
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 

Semelhante a Ideas spracklen-final (20)

Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Storage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningStorage Challenges for Production Machine Learning
Storage Challenges for Production Machine Learning
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Fms invited talk_2018 v5
Fms invited talk_2018 v5Fms invited talk_2018 v5
Fms invited talk_2018 v5
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 

Último

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

Ideas spracklen-final

  • 1. Big Data for the rest of us Lawrence Spracklen SupportLogic lawrence@supportlogic.io www.linkedin.com/in/spracklen
  • 2. SupportLogic • Extract Signals from enterprise CRM systems • Applied machine learning • Complete vertical solution • Go-live in days! • We are hiring!
  • 3. @Scale 2018 • Sound like your Big Data problems? • This is Extreme data! • Do these solutions help or hinder Big Data for the rest of us? “Exabytes of data…..” “1500 manual labelers…..” “Sub second global propagation of likes…..”
  • 4. End-2-End Planning • Numerous steps/obstacles to successfully leveraging ML • Data Acquisition • Data Cleansing • Feature Engineering • Model Selection and Training • Model Optimization • Model Deployment • Model Feedback and Retraining • Import to consider all steps before deciding on an approach • Upstream decisions can severely limit downstream options
  • 5. ML Landscape • How do I build a successful production-grade solution from all these disparate components that don’t play well together?
  • 6. Data Set Availability • Is the necessary data available? • Are there HIPAA, PII, GDPR concerns? • Is it spread across multiple systems? • Can the systems communicate? • Data fusion • Move the compute to the data… • Legacy infrastructure decisions can dictate optimal approach
  • 7. Feature Engineering • Essential for model performance, efficacy, robustness and simplicity • Feature extraction • Feature selection • Feature construction • Feature elimination • Dimensionality reduction • Traditionally a laborious manual process • Automation techniques becoming available • e.g. TransmogrifAI, Featuretools • Leverage feature stores!
  • 8. Model Training • Big differences in the range of algorithms offered by different frameworks • Don’t just jump to the most complex! • Easy to automate selection process • Just click ‘go’ • Automate hyperparameter optimization • Beyond the nested for-loop!
  • 9. Model Ops • What happens after the models are created? • How does the business benefit from the insights • Operationalization is frequently the weak link • Operationalizing PowerPoint? • Hand rolled scoring flows?
  • 10. Barriers to Model Ops • Scoring often performed on a different data platform to training • Framework specific persistence formats • Complex data preprocessing requirements • Data cleansing and feature engineering • Batch training versus RT/stream scoring • How frequently are models updated? • How is performance monitored?
  • 12. PMML & PFA • PMML has been long available as framework agnostic model representation • Frequently requires helper scripts • PFA is the potential successor…. • Addresses lots of PMML’s shortcomings • Scoring engines accepting R or Python scripts • Easy to use AWS Lambda!
  • 13. Interpreting Models • A prediction without an explanation limits its value • Why is this outcome being predicted? • What action should be taken as a result? • Avoid ML models that are “black Boxes” • Tools for providing prediction explanations are emerging • E.g. LIME
  • 15. Prototype in Python • Explore the space! • Work through the end-2-end solution • Don’t prematurely optimize • Great Python tooling • e.g. Juypter Notebooks, Cloudera Data Science workbench • Don’t let the data leak to laptops!
  • 16. Python is slow • Python is simple, flexible and has massive available functionality • Pure Python typically hundreds of times slower than C • Many Python implementations leverage C under-the-hood • Even naive Scala or Java implementations are slow
  • 18. Everything Python • Python wrappers are available for most packages • Even momentum in Spark is moving to Python • Wrappers for C++ libraries like Shogun
  • 19. Spark • Optimizing for speed, data size or both? • Increasingly rich set of ML algorithms • Still missing common algorithms • E.g. Multiclass GBTs • Not all OSS implementations are good • Hard to correctly resource Spark jobs • Autotuning systems available
  • 20. System Sizing • Why go multi-node? • CPU or Memory constraints • Aggregate data size is very different from the size of the individual data sets • A Data lake can contain Petabytes, but each dataset may be only 10’s of GB…. • Is the raw data bigger or smaller than final data being consumed by the model? • Spark for ETL • Is the algorithm itself parallel?
  • 21. Single Node ML • Single node memory on even x86 systems can now measure in tens of terabytes • Likely to expand further with NVDIMMs • 40vCPU, ~1TB x86 only $4/hour on Google Cloud • Many high performance single-node ML libraries exist!
  • 22. Hive & Postgres • On Hadoop, many data scientists are constrained to Hive or Impala for security reasons • Can be very limiting for ‘real’ data science • Hivemall for analytics • Is a traditional DB a better choice? • Better performance in many instances • Apache MadLib for analytics
  • 23. Conclusions • No one-size fits all! • Much more to a successful ML project than a cool model • Not all frameworks play together • Decisions can limit downstream options • Need to think about the problem end-2-end • From data acquisition to model deployment