SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
ML + H2O
AlexTellez & Michal Malohlava
www.h2o.ai
lib .ai
THE RED PILL (SPARK + ML)
Finally, ONE TO RULE THEM ALL!
1. Scrape & Collect Data
2. Cleanse Data + Feature Extraction / Engineering
3. Build Machine Learning Models + Iterate
4. Throw More Data to Improve Model
5. Deploy Model(s) in Real-Time
THE BLUE PILL (H2O.AI)
What is H2O? (water, duh!)
It is ALSO an open-source, distributed and parallel predictive
engine for machine learning.
What makes H2O different?
Cutting-edge algorithms + parallel architecture + ease-of-use
=
Happy Data Scientists / Analysts
WHY NOT BOTH PILLS?!
Build smarter applications USING BOTH in harmony within
the Spark Ecosystem !!!
Convert Spark RDDs H2O RDDs for Machine Learning
LET’S BUILD AN APP!
Task: Predict the job category from
a Craigslist AdTitle
ML WORKFLOW
1. Perform Feature Extraction on Words + Munging
2. Run Word2Vec algo (MLlib) on JobTitle words
3. Create “title vectors” from
individual word vectors for each job title
4. Pass the Spark RDD H2O RDD for ML in Flow
5. Run H2O GBM algorithm on H2O RDD
6. Create Spark Streaming Application + Score on new data
1.TEXT MUNGING
Example: “Site Supervisor and Pre K Teachers Needed Now!!!”
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
val tokens = jobTitles.map(line => token(line))
Next: Apply Spark’s Word2Vec model to each word
2.WORD2VEC
Simply: A mathematical way to represent a single word as a vector of
numbers. These vector ‘representations’ encode information about the
about a given word (i.e. its meaning)
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
Post Word2Vec Results:
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
BUTTHAT’S ON WORDS!
Post Word2Vec Results:
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
WE NEED TITLE VECTORS BASED ON ALL THE WORDS!
HOW?
Averaging word vectors to make ‘TitleVectors’
v(King) - v(Man) +V(Woman) ~ v(Queen)
3.TITLEVECTORS
In Steps:
1. Sum the word2vec vectors in a given title
2. Divide this sum by # of words in a given title
Result: ~ Average vector for a given title of N words
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+
+
Divide by Total Words (post tokenization)
~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]
4. PASS SPARK RDDTO H2O
OPEN H2O FLOW!
5. BUILD A MODEL!
80% ACCURACY - DEFAULT!
Algo: Gradient Boosting Machine
#Trees: 50
# Bins: 20
Depth: 5
(ALL DEFAULTVALUES)
~ 20% Error Rate
6. SPARK STREAMING +
DEPLOYMENT
Create Spark Streaming App to read in new Job Titles
a) Create a Spark Streaming Producer - Reads data from a file &
generates events in real-time which we will predict category.
APP ARCHITECTURE
Posting
job title
“HIRING
Painting
CONTRACTORS
NOW!!!”
Stream
Categorize
a job title
Prediction = “Labor”
Re-train
the model
Craigslist jobs
Word2Vec
Model
GBM

Model
Word2Vec
Train a model
“ASK CRAIG” LIVE DEMO!
END-TO-END
In JUST 25 minutes…we:
1. Performed sophisticated feature extraction + engineering
2. Passed a Spark RDD H2O RDD for ML
3. Created a Spark Stream to read in new data
5. “Productionalized” H2O + Spark MLlib model to score on new data
So happy I took
both pills!
4. Built a GBM to classify titles w/ 80% accuracy
TRY SPARKLING WATER!!
Download @ h2o.ai
Coming Soon: Release 1.4 for Spark 1.4!
NEW GUI! H2O FLOW
Meetup: SiliconValley Big Data Science

Mais conteúdo relacionado

Mais procurados

Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesRomi Kuntsman
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
 
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Bruno Bonnin
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkLegacy Typesafe (now Lightbend)
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingSpark Summit
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 

Mais procurados (20)

Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Vocanic Map Reduce Lite
Vocanic Map Reduce LiteVocanic Map Reduce Lite
Vocanic Map Reduce Lite
 

Semelhante a Sparkling Water, ASK CRAIG

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
Rails in the Cloud
Rails in the CloudRails in the Cloud
Rails in the Cloudiwarshak
 
Web technologies-course 12.pptx
Web technologies-course 12.pptxWeb technologies-course 12.pptx
Web technologies-course 12.pptxStefan Oprea
 
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...Jim Czuprynski
 
BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)
BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)
BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)Ifeanyi I Nwodo(De Jeneral)
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBRussell Jurney
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
 
Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Brian Ho
 
Part 7 packaging and deployment
Part 7 packaging and deploymentPart 7 packaging and deployment
Part 7 packaging and deploymenttechbed
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Yeoman AngularJS and D3 - A solid stack for web apps
Yeoman AngularJS and D3 - A solid stack for web appsYeoman AngularJS and D3 - A solid stack for web apps
Yeoman AngularJS and D3 - A solid stack for web appsclimboid
 
M.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMS
M.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMSM.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMS
M.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMSSupriya Radhakrishna
 
Engineering Highly Maintainable Code: Maintain or Innovate
Engineering Highly Maintainable Code: Maintain or InnovateEngineering Highly Maintainable Code: Maintain or Innovate
Engineering Highly Maintainable Code: Maintain or InnovateSteve Andrews
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessLuciano Mammino
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020Thodoris Bais
 
Entity frameworks101
Entity frameworks101Entity frameworks101
Entity frameworks101Rich Helton
 

Semelhante a Sparkling Water, ASK CRAIG (20)

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
Rails in the Cloud
Rails in the CloudRails in the Cloud
Rails in the Cloud
 
Web technologies-course 12.pptx
Web technologies-course 12.pptxWeb technologies-course 12.pptx
Web technologies-course 12.pptx
 
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...
Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)
BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)
BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017
 
Part 7 packaging and deployment
Part 7 packaging and deploymentPart 7 packaging and deployment
Part 7 packaging and deployment
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Yeoman AngularJS and D3 - A solid stack for web apps
Yeoman AngularJS and D3 - A solid stack for web appsYeoman AngularJS and D3 - A solid stack for web apps
Yeoman AngularJS and D3 - A solid stack for web apps
 
M.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMS
M.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMSM.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMS
M.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMS
 
Engineering Highly Maintainable Code: Maintain or Innovate
Engineering Highly Maintainable Code: Maintain or InnovateEngineering Highly Maintainable Code: Maintain or Innovate
Engineering Highly Maintainable Code: Maintain or Innovate
 
02 objective-c session 2
02  objective-c session 202  objective-c session 2
02 objective-c session 2
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti Serverless
 
Linq to sql
Linq to sqlLinq to sql
Linq to sql
 
GraphDatabase.pptx
GraphDatabase.pptxGraphDatabase.pptx
GraphDatabase.pptx
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
 
Entity frameworks101
Entity frameworks101Entity frameworks101
Entity frameworks101
 

Mais de Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Mais de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Último

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 

Último (20)

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 

Sparkling Water, ASK CRAIG

  • 1. ML + H2O AlexTellez & Michal Malohlava www.h2o.ai lib .ai
  • 2. THE RED PILL (SPARK + ML) Finally, ONE TO RULE THEM ALL! 1. Scrape & Collect Data 2. Cleanse Data + Feature Extraction / Engineering 3. Build Machine Learning Models + Iterate 4. Throw More Data to Improve Model 5. Deploy Model(s) in Real-Time
  • 3. THE BLUE PILL (H2O.AI) What is H2O? (water, duh!) It is ALSO an open-source, distributed and parallel predictive engine for machine learning. What makes H2O different? Cutting-edge algorithms + parallel architecture + ease-of-use = Happy Data Scientists / Analysts
  • 4. WHY NOT BOTH PILLS?! Build smarter applications USING BOTH in harmony within the Spark Ecosystem !!! Convert Spark RDDs H2O RDDs for Machine Learning
  • 5. LET’S BUILD AN APP! Task: Predict the job category from a Craigslist AdTitle
  • 6. ML WORKFLOW 1. Perform Feature Extraction on Words + Munging 2. Run Word2Vec algo (MLlib) on JobTitle words 3. Create “title vectors” from individual word vectors for each job title 4. Pass the Spark RDD H2O RDD for ML in Flow 5. Run H2O GBM algorithm on H2O RDD 6. Create Spark Streaming Application + Score on new data
  • 7. 1.TEXT MUNGING Example: “Site Supervisor and Pre K Teachers Needed Now!!!” Post Tokenization: Seq(site, supervisor, pre, teachers, needed) val tokens = jobTitles.map(line => token(line)) Next: Apply Spark’s Word2Vec model to each word
  • 8. 2.WORD2VEC Simply: A mathematical way to represent a single word as a vector of numbers. These vector ‘representations’ encode information about the about a given word (i.e. its meaning) Post Tokenization: Seq(site, supervisor, pre, teachers, needed) Post Word2Vec Results: needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
  • 9. BUTTHAT’S ON WORDS! Post Word2Vec Results: needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] WE NEED TITLE VECTORS BASED ON ALL THE WORDS! HOW? Averaging word vectors to make ‘TitleVectors’ v(King) - v(Man) +V(Woman) ~ v(Queen)
  • 10. 3.TITLEVECTORS In Steps: 1. Sum the word2vec vectors in a given title 2. Divide this sum by # of words in a given title Result: ~ Average vector for a given title of N words needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+ + Divide by Total Words (post tokenization) ~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]
  • 11. 4. PASS SPARK RDDTO H2O OPEN H2O FLOW!
  • 12. 5. BUILD A MODEL!
  • 13. 80% ACCURACY - DEFAULT! Algo: Gradient Boosting Machine #Trees: 50 # Bins: 20 Depth: 5 (ALL DEFAULTVALUES) ~ 20% Error Rate
  • 14. 6. SPARK STREAMING + DEPLOYMENT Create Spark Streaming App to read in new Job Titles a) Create a Spark Streaming Producer - Reads data from a file & generates events in real-time which we will predict category.
  • 15. APP ARCHITECTURE Posting job title “HIRING Painting CONTRACTORS NOW!!!” Stream Categorize a job title Prediction = “Labor” Re-train the model Craigslist jobs Word2Vec Model GBM
 Model Word2Vec Train a model
  • 17. END-TO-END In JUST 25 minutes…we: 1. Performed sophisticated feature extraction + engineering 2. Passed a Spark RDD H2O RDD for ML 3. Created a Spark Stream to read in new data 5. “Productionalized” H2O + Spark MLlib model to score on new data So happy I took both pills! 4. Built a GBM to classify titles w/ 80% accuracy
  • 18. TRY SPARKLING WATER!! Download @ h2o.ai Coming Soon: Release 1.4 for Spark 1.4! NEW GUI! H2O FLOW Meetup: SiliconValley Big Data Science