Sparkling Water, ASK CRAIG

•

13 gostaram•6,078 visualizações

Machine Learning, Deep Learning - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Software

ML + H2O
AlexTellez & Michal Malohlava
www.h2o.ai
lib .ai

THE RED PILL (SPARK + ML)
Finally, ONE TO RULE THEM ALL!
1. Scrape & Collect Data
2. Cleanse Data + Feature Extraction / Engineering
3. Build Machine Learning Models + Iterate
4. Throw More Data to Improve Model
5. Deploy Model(s) in Real-Time

THE BLUE PILL (H2O.AI)
What is H2O? (water, duh!)
It is ALSO an open-source, distributed and parallel predictive
engine for machine learning.
What makes H2O different?
Cutting-edge algorithms + parallel architecture + ease-of-use
=
Happy Data Scientists / Analysts

WHY NOT BOTH PILLS?!
Build smarter applications USING BOTH in harmony within
the Spark Ecosystem !!!
Convert Spark RDDs H2O RDDs for Machine Learning

LET’S BUILD AN APP!
Task: Predict the job category from
a Craigslist AdTitle

ML WORKFLOW
1. Perform Feature Extraction on Words + Munging
2. Run Word2Vec algo (MLlib) on JobTitle words
3. Create “title vectors” from
individual word vectors for each job title
4. Pass the Spark RDD H2O RDD for ML in Flow
5. Run H2O GBM algorithm on H2O RDD
6. Create Spark Streaming Application + Score on new data

1.TEXT MUNGING
Example: “Site Supervisor and Pre K Teachers Needed Now!!!”
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
val tokens = jobTitles.map(line => token(line))
Next: Apply Spark’s Word2Vec model to each word

2.WORD2VEC
Simply: A mathematical way to represent a single word as a vector of
numbers. These vector ‘representations’ encode information about the
about a given word (i.e. its meaning)
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
Post Word2Vec Results:
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]

BUTTHAT’S ON WORDS!
Post Word2Vec Results:
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
WE NEED TITLE VECTORS BASED ON ALL THE WORDS!
HOW?
Averaging word vectors to make ‘TitleVectors’
v(King) - v(Man) +V(Woman) ~ v(Queen)

3.TITLEVECTORS
In Steps:
1. Sum the word2vec vectors in a given title
2. Divide this sum by # of words in a given title
Result: ~ Average vector for a given title of N words
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+
+
Divide by Total Words (post tokenization)
~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]

80% ACCURACY - DEFAULT!
Algo: Gradient Boosting Machine
#Trees: 50
# Bins: 20
Depth: 5
(ALL DEFAULTVALUES)
~ 20% Error Rate

6. SPARK STREAMING +
DEPLOYMENT
Create Spark Streaming App to read in new Job Titles
a) Create a Spark Streaming Producer - Reads data from a ﬁle &
generates events in real-time which we will predict category.

APP ARCHITECTURE
Posting
job title
“HIRING
Painting
CONTRACTORS
NOW!!!”
Stream
Categorize
a job title
Prediction = “Labor”
Re-train
the model
Craigslist jobs
Word2Vec
Model
GBM 
Model
Word2Vec
Train a model

END-TO-END
In JUST 25 minutes…we:
1. Performed sophisticated feature extraction + engineering
2. Passed a Spark RDD H2O RDD for ML
3. Created a Spark Stream to read in new data
5. “Productionalized” H2O + Spark MLlib model to score on new data
So happy I took
both pills!
4. Built a GBM to classify titles w/ 80% accuracy

TRY SPARKLING WATER!!
Download @ h2o.ai
Coming Soon: Release 1.4 for Spark 1.4!
NEW GUI! H2O FLOW
Meetup: SiliconValley Big Data Science

Mais conteúdo relacionado

Mais procurados

Multi dimension aggregations using spark and dataframesRomi Kuntsman

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks

Overview of the Hive Stinger InitiativeModern Data Stack France

Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks

Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !Bruno Bonnin

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien

The How and Why of Fast Data Analytics with Apache SparkLegacy Typesafe (now Lightbend)

SMACK Stack 1.1Joe Stein

Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)

(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services

Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks

Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingSpark Summit

A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence

Spark streaming state of the unionDatabricks

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly

Building data pipelinesJonathan Holloway

Spark SQL - 10 Things You Need to KnowKristian Alexander

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

SparkSQL: A Compiler from Queries to RDDsDatabricks

Vocanic Map Reduce LiteShreeniwas Iyer

Mais procurados (20)

Multi dimension aggregations using spark and dataframes

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

Overview of the Hive Stinger Initiative

Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...

Apache Spark avec NodeJS ? Oui, c'est possible avec EclairJS !

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

The How and Why of Fast Data Analytics with Apache Spark

SMACK Stack 1.1

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

(CMP310) Data Processing Pipelines Using Containers & Spot Instances

Spark SQL Deep Dive @ Melbourne Spark Meetup

Bulletproof Jobs: Patterns For Large-Scale Spark Processing

A Beginner's Guide to Building Data Pipelines with Luigi

Spark streaming state of the union

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...

Building data pipelines

Spark SQL - 10 Things You Need to Know

Unified Big Data Processing with Apache Spark (QCON 2014)

SparkSQL: A Compiler from Queries to RDDs

Vocanic Map Reduce Lite

Semelhante a Sparkling Water, ASK CRAIG

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters

Rails in the Cloudiwarshak

Web technologies-course 12.pptxStefan Oprea

Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...Jim Czuprynski

Ruby On RailsBalint Erdi

BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)Ifeanyi I Nwodo(De Jeneral)

Agile Data Science 2.0: Using Spark with MongoDBRussell Jurney

H2O World - Sparkling Water - Michal MalohlavaSri Ambati

Deep learning Malaysia presentation 12/4/2017Brian Ho

Part 7 packaging and deploymenttechbed

Agile Data Science 2.0Russell Jurney

Yeoman AngularJS and D3 - A solid stack for web appsclimboid

M.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMSSupriya Radhakrishna

Engineering Highly Maintainable Code: Maintain or InnovateSteve Andrews

02 objective-c session 2Amr Elghadban (AmrAngry)

Semplificare l'observability per progetti ServerlessLuciano Mammino

Linq to sqlMuhammad Younis

GraphDatabase.pptxJeyaVarthini1

NoSQL Endgame DevoxxUA Conference 2020Thodoris Bais

Entity frameworks101Rich Helton

Semelhante a Sparkling Water, ASK CRAIG (20)

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...

Rails in the Cloud

Web technologies-course 12.pptx

Politics Ain’t Beanbag: Using APEX, ML, and GeoCoding In a Modern Election Ca...

Ruby On Rails

BI Tutorial (Copying Data from Oracle to Microsoft SQLServer)

Agile Data Science 2.0: Using Spark with MongoDB

H2O World - Sparkling Water - Michal Malohlava

Deep learning Malaysia presentation 12/4/2017

Part 7 packaging and deployment

Agile Data Science 2.0

Yeoman AngularJS and D3 - A solid stack for web apps

M.TECH 1ST SEM COMPUTER SCIENCE ADBMS LAB PROGRAMS

Engineering Highly Maintainable Code: Maintain or Innovate

02 objective-c session 2

Semplificare l'observability per progetti Serverless

Linq to sql

GraphDatabase.pptx

NoSQL Endgame DevoxxUA Conference 2020

Entity frameworks101

Mais de Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Generative AI Masterclass - Model Risk Management.pptxSri Ambati

AI and the Future of Software Development: A Sneak Peek Sri Ambati

LLMOps: Match report from the top of the 5thSri Ambati

Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati

Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati

Risk Management for LLMsSri Ambati

Open-Source AI: Community is the WaySri Ambati

Building Custom GenAI Apps at H2OSri Ambati

Applied Gen AI for the Finance Vertical Sri Ambati

Cutting Edge Tricks from LLM PapersSri Ambati

Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati

Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati

KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati

LLM Interpretability Sri Ambati

Never Reply to an Email AgainSri Ambati

Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati

From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati

AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati

AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati

Mais de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Generative AI Masterclass - Model Risk Management.pptx

AI and the Future of Software Development: A Sneak Peek

LLMOps: Match report from the top of the 5th

Building, Evaluating, and Optimizing your RAG App for Production

Building LLM Solutions using Open Source and Closed Source Solutions in Coher...

Risk Management for LLMs

Open-Source AI: Community is the Way

Building Custom GenAI Apps at H2O

Applied Gen AI for the Finance Vertical

Cutting Edge Tricks from LLM Papers

Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...

Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...

KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...

LLM Interpretability

Never Reply to an Email Again

Introducción al Aprendizaje Automatico con H2O-3 (1)

From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...

AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...

AI Foundations Course Module 1 - An AI Transformation Journey

Último

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Powering Real-Time Decisions with Continuous Data StreamsSafe Software

Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray

Precise and Complete Requirements? An Elusive GoalLionel Briand

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

VK Business Profile - provides IT solutions and Web Developmentvyaparkranti

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0

Sparkling Water, ASK CRAIG

1. ML + H2O AlexTellez & Michal Malohlava www.h2o.ai lib .ai

2. THE RED PILL (SPARK + ML) Finally, ONE TO RULE THEM ALL! 1. Scrape & Collect Data 2. Cleanse Data + Feature Extraction / Engineering 3. Build Machine Learning Models + Iterate 4. Throw More Data to Improve Model 5. Deploy Model(s) in Real-Time

3. THE BLUE PILL (H2O.AI) What is H2O? (water, duh!) It is ALSO an open-source, distributed and parallel predictive engine for machine learning. What makes H2O different? Cutting-edge algorithms + parallel architecture + ease-of-use = Happy Data Scientists / Analysts

4. WHY NOT BOTH PILLS?! Build smarter applications USING BOTH in harmony within the Spark Ecosystem !!! Convert Spark RDDs H2O RDDs for Machine Learning

5. LET’S BUILD AN APP! Task: Predict the job category from a Craigslist AdTitle

6. ML WORKFLOW 1. Perform Feature Extraction on Words + Munging 2. Run Word2Vec algo (MLlib) on JobTitle words 3. Create “title vectors” from individual word vectors for each job title 4. Pass the Spark RDD H2O RDD for ML in Flow 5. Run H2O GBM algorithm on H2O RDD 6. Create Spark Streaming Application + Score on new data

7. 1.TEXT MUNGING Example: “Site Supervisor and Pre K Teachers Needed Now!!!” Post Tokenization: Seq(site, supervisor, pre, teachers, needed) val tokens = jobTitles.map(line => token(line)) Next: Apply Spark’s Word2Vec model to each word

8. 2.WORD2VEC Simply: A mathematical way to represent a single word as a vector of numbers. These vector ‘representations’ encode information about the about a given word (i.e. its meaning) Post Tokenization: Seq(site, supervisor, pre, teachers, needed) Post Word2Vec Results: needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]

9. BUTTHAT’S ON WORDS! Post Word2Vec Results: needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] WE NEED TITLE VECTORS BASED ON ALL THE WORDS! HOW? Averaging word vectors to make ‘TitleVectors’ v(King) - v(Man) +V(Woman) ~ v(Queen)

10. 3.TITLEVECTORS In Steps: 1. Sum the word2vec vectors in a given title 2. Divide this sum by # of words in a given title Result: ~ Average vector for a given title of N words needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987] supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+ + Divide by Total Words (post tokenization) ~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]

11. 4. PASS SPARK RDDTO H2O OPEN H2O FLOW!

12. 5. BUILD A MODEL!

13. 80% ACCURACY - DEFAULT! Algo: Gradient Boosting Machine #Trees: 50 # Bins: 20 Depth: 5 (ALL DEFAULTVALUES) ~ 20% Error Rate

14. 6. SPARK STREAMING + DEPLOYMENT Create Spark Streaming App to read in new Job Titles a) Create a Spark Streaming Producer - Reads data from a ﬁle & generates events in real-time which we will predict category.

15. APP ARCHITECTURE Posting job title “HIRING Painting CONTRACTORS NOW!!!” Stream Categorize a job title Prediction = “Labor” Re-train the model Craigslist jobs Word2Vec Model GBM  Model Word2Vec Train a model

16. “ASK CRAIG” LIVE DEMO!

17. END-TO-END In JUST 25 minutes…we: 1. Performed sophisticated feature extraction + engineering 2. Passed a Spark RDD H2O RDD for ML 3. Created a Spark Stream to read in new data 5. “Productionalized” H2O + Spark MLlib model to score on new data So happy I took both pills! 4. Built a GBM to classify titles w/ 80% accuracy

18. TRY SPARKLING WATER!! Download @ h2o.ai Coming Soon: Release 1.4 for Spark 1.4! NEW GUI! H2O FLOW Meetup: SiliconValley Big Data Science

Sparkling Water, ASK CRAIG

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Sparkling Water, ASK CRAIG

Semelhante a Sparkling Water, ASK CRAIG (20)

Mais de Sri Ambati

Mais de Sri Ambati (20)

Último

Último (20)

Sparkling Water, ASK CRAIG