SlideShare uma empresa Scribd logo
1 de 50
Baixar para ler offline
Sparkling Water
Meetup
@h2oai & @mmalohlava
presents
Memory efficient
Performance of computation
Machine learning algorithms
Parser, GUI, R-interface
User-friendly API for data
transformation
Large and active community
Platform components - SQL
Multitenancy
Sparkling Water
Spark H2O
HDFS
Spark H2O
HDFS
+
RDD

immutable
world
DataFrame

mutable
world
Sparkling Water
Spark H2O
HDFS
Spark H2O
HDFS RDD DataFrame
Sparkling Water
Provides
Transparent integration into Spark ecosystem
Pure H2ORDD encapsulating H2O DataFrame
Transparent use of H2O data structures and
algorithms with Spark API
Excels in Spark workflows requiring
advanced Machine Learning algorithms
Sparkling Water Design
spark-submit
Spark
Master
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Sparkling Water Cluster
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Contains application
and Sparkling Water
classes
Sparkling
App
implements
?
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVM
Data
Source
(e.g.
HDFS)
H2O
RDD
Spark Executor JVM
Spark Executor JVM
Spark
RDD
RDDs and DataFrames
share same memory
space
Devel Internals
Sparkling Water Assembly
H2O
Core
H2O
Algos
H2O
Scala
API
H2O
Flow
Sparkling Water Core
Spark Platform
Spark
Core
Spark
SQL
Application
Code+
Assembly is deployed
to Spark cluster as regular
Spark application
Hands-On #1

Sparkling Shell
Sparkling Water
Requirements
Linux or Mac OS X
Oracle Java 1.7+
Spark 1.1.0
Download
http://h2o.ai/download/
Where is the code?
https://github.com/h2oai/sparkling-water/
blob/master/examples/scripts/
Flight delays prediction
“Build a model using weather
and flight data to predict delays
of flights arriving to Chicago
O’Hare International Airport”
Example Outline
Load & Parse CSV data from 2 data sources
Use Spark API to filter data, do SQL query
for join
Create regression models
Use models to predict delays
Graph residual plot from R
Install and Launch
Unpack zip file



and 



Point SPARK_HOME to your Spark 1.1.0
installation
and



Launch bin/sparkling-shell
What is Sparkling Shell?
Standard spark-shell
With additional Sparkling Water classes
export MASTER=“local-cluster[3,2,1024]”
spark-shell 
—-jars sparkling-water.jar
JAR containing
Sparkling
Water
Spark Master
address
Lets play with Sparkling
shell…
Create H2O Client
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._


val h2oContext = new H2OContext(sc).start(3)

import h2oContext._
Regular Spark context
provided by Spark shell
Size of demanded
H2O cloud
Contains implicit utility functions Demo specific
classes
Is Spark Running?
Go to http://localhost:4040
Is H2O running?
http://localhost:54321/flow/index.html
Load Data #1
Load weather data into RDD


val weatherDataFile =
“examples/smalldata/weather.csv"


val wrawdata = sc.textFile(weatherDataFile,3)
.cache()

val weatherTable = wrawdata
.map(_.split(“,"))
.map(row => WeatherParse(row))
.filter(!_.isWrongRow())
Regular Spark API
Ad-hoc Parser
Weather Data
case class Weather( val Year : Option[Int],

val Month : Option[Int],

val Day : Option[Int],

val TmaxF : Option[Int], // Max temperatur in F

val TminF : Option[Int], // Min temperatur in F

val TmeanF : Option[Float], // Mean temperatur in F

val PrcpIn : Option[Float], // Precipitation (inches)

val SnowIn : Option[Float], // Snow (inches)

val CDD : Option[Float], // Cooling Degree Day

val HDD : Option[Float], // Heating Degree Day

val GDD : Option[Float]) // Growing Degree Day
Simple POJO to hold one row of weather data
Load Data #2
Load flights data into H2O frame
import java.io.File
val dataFile =
“examples/smalldata/year2005.csv.gz"


val airlinesData = new DataFrame(new File(dataFile))
Shortcut for data load
and parse
Where is the data?
Go to http://localhost:54321/flow/
index.html
Use Spark API for Data
Filtering
// Create RDD wrapper around DataFrame

val airlinesTable : RDD[Airlines]
= asRDD[Airlines](airlinesData)

// And use Spark RDD API directly

val flightsToORD = airlinesTable
.filter( f => f.Dest == Some(“ORD") )
Regular Spark
RDD call
Create a cheap wrapper
around H2O DataFrame
Use Spark SQL to Data
Join
import org.apache.spark.sql.SQLContext
// We need to create SQL context 

implicit val sqlContext = new SQLContext(sc)

import sqlContext._ 

flightsToORD.registerTempTable("FlightsToORD")

weatherTable.registerTempTable("WeatherORD")
Make context implicit to
share it with h2oContext
Join Data based on Flight
Date
val joinedTable = sql(

"""SELECT

| f.Year,f.Month,f.DayofMonth,

| f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,

| f.UniqueCarrier,f.FlightNum,f.TailNum,

| f.Origin,f.Distance,

| w.TmaxF,w.TminF,w.TmeanF,
| w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,

| f.ArrDelay

| FROM FlightsToORD f

| JOIN WeatherORD w

| ON f.Year=w.Year AND f.Month=w.Month
| AND f.DayofMonth=w.Day""".stripMargin)
Split data
import hex.splitframe.SplitFrame

import hex.splitframe.SplitFrameModel.SplitFrameParameters
val sfParams = new SplitFrameParameters()

sfParams._train = joinedTable

sfParams._ratios = Array(0.7, 0.2)

val sf = new SplitFrame(sfParams)



val splits = sf.trainModel().get._output._splits

val trainTable = splits(0)

val validTable = splits(1)

val testTable = splits(2)
Result of
SQL query is
implicitly
converted
into H2O
DataFrame
Launch H2O Algorithms
import hex.deeplearning._

import hex.deeplearning.DeepLearningModel
.DeepLearningParameters


// Setup deep learning parameters
val dlParams = new DeepLearningParameters()

dlParams._train = trainTable

dlParams._response_column = 'ArrDelay

dlParams._valid = validTable

dlParams._epochs = 100

dlParams._reproducible = true

dlParams._force_load_balance = false
// Create a new model builder
val dl = new DeepLearning(dlParams)

val dlModel = dl.trainModel.get
Blocking call
Make a prediction
// Use model to score data
val dlPredictTable = dlModel.score(testTable)(‘predict)
// Collect predicted values via RDD API

val predictionValues = toShemaRDD(dlPredictTable)

.collect

.map (row =>
if (row.isNullAt(0))
Double.NaN
else
row(0)
Hands-On #2



Can I access results from R?
YES!
Requirements
R 3.1.2+
RStudio
H2O R package
Install R package
You can find R package 

on USB stick
1. Open RStudio
2. Click on 

“Install Packages”
3. Select 

h2o_0.1.20.99999.tar.gz 

file from USB
Generate R code
import org.apache.spark.examples.h2o.DemoUtils.residualPlotRCode
residualPlotRCode(
predictionH2OFrame, 'predict,
testFrame, 'ArrDelay)
Utility generating

R code to show

residuals plot for 

predicted and actual

values
In Sparkling Shell:
Residuals Plot in R
# Import H2O library and initialize H2O client
library(h2o)
h = h2o.init()
# Fetch prediction and actual data, use remembered keys
pred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69")
act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")
# Select right columns
predDelay = pred$predict
actDelay = act$ArrDelay
# Make sure that number of rows is same
nrow(actDelay) == nrow(predDelay)
# Compute residuals
residuals = predDelay - actDelay
# Plot residuals
compare = cbind(
as.data.frame(actDelay$ArrDelay),
as.data.frame(residuals$predict))
plot( compare[,1:2] )
References
of data
residuals
If you are running R v3.1.0 you will see
different plot:
Warning!
Why? Float number handling was changed in
that version. Our recommendation is to
upgrade your R to the newest version.
Try GBM Algo
import hex.tree.gbm.GBM

import hex.tree.gbm.GBMModel.GBMParameters

val gbmParams = new GBMParameters()

gbmParams._train = trainTable

gbmParams._response_column = 'ArrDelay

gbmParams._valid = validTable

gbmParams._ntrees = 100



val gbm = new GBM(gbmParams)

val gbmModel = gbm.trainModel.get
// Print R code for residual plot

val gbmPredictTable = gbmModel.score(testTable)('predict)

printf( residualPlotRCode(gbmPredictTable, 'predict, testTable,
'ArrDelay) )
Residuals plot for GBM
predictionresiduals
Hands-On #3


How Can I Develop and
Run Standalone App?
Requirements
Idea or Eclipse
Git
Use Sparkling Water
Droplet
Clone H2O Droplets repository
git clone https://github.com/h2oai/h2o-droplets.git
cd h2o-droplets/sparkling-water-droplet/
Generate IDE project
For Idea
For Eclipse
./gradlew idea
./gradlew eclipse
… add import project into your IDE
Create An Application
object AirlinesWeatherAnalysis {



/** Entry point */

def main(args: Array[String]) {

// Configure this application

val conf: SparkConf = new SparkConf().setAppName("Flights Water")

conf.setIfMissing("spark.master", sys.env.getOrElse("spark.master", "local"))



// Create SparkContext to execute application on Spark cluster

val sc = new SparkContext(conf)

// Start H2O cluster only

new H2OContext(sc).start()



// User code 

// . . .

}

}
Create
Spark Context
Create H2O context
and start H2O on top
of Spark
Build the Application
./gradlew build shadowJar
Create an assembly

which can be submitted

to Spark cluster
Build and test
Run code on Spark
#!/usr/bin/env bash
APP_CLASS=water.droplets.AirlineWeatherAnalysis
FAT_JAR_FILE=“build/libs/sparkling-water-droplet-app.jar”
MASTER=${MASTER:-"local-cluster[3,2,1024]"}
DRIVER_MEMORY=2g
$SPARK_HOME/bin/spark-submit "$@" 
--driver-memory $DRIVER_MEMORY 
--master $MASTER 
--class "$APP_CLASS" $FAT_JAR_FILE
It is Open Source!
You can participate in
H2O Scala API
Sparkling Water testing
Mesos, Yarn, workflows (PUBDEV-23,26,27,31-33)
Spark integration
MLLib Pipelines Check out our JIRA
at http://jira.h2o.ai
Come to Meetup
http://www.meetup.com/Silicon-Valley-Big-Data-Science/
More info
Checkout H2O.ai Training Books
http://learn.h2o.ai/

Checkout H2O.ai Blog for Sparkling Water tutorials
http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata

Checkout GitHub
https://github.com/h2oai/sparkling-water
Learn more about H2O at
h2o.ai
or
Thank you!
Follow us at @h2oai
> for r in sparkling-water; do
git clone “git@github.com:h2oai/$r.git”
done
And the winner is
…

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Scalding
ScaldingScalding
Scalding
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Scala+data
Scala+dataScala+data
Scala+data
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Audience counting at Scale
Audience counting at ScaleAudience counting at Scale
Audience counting at Scale
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 

Destaque

Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
Sri Ambati
 

Destaque (11)

Using H2O Random Grid Search for Hyper-parameters Optimization
Using H2O Random Grid Search for Hyper-parameters OptimizationUsing H2O Random Grid Search for Hyper-parameters Optimization
Using H2O Random Grid Search for Hyper-parameters Optimization
 
H2O Random Grid Search - PyData Amsterdam
H2O Random Grid Search - PyData AmsterdamH2O Random Grid Search - PyData Amsterdam
H2O Random Grid Search - PyData Amsterdam
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning
 
Deep Learning for Public Safety in Chicago and San Francisco
Deep Learning for Public Safety in Chicago and San FranciscoDeep Learning for Public Safety in Chicago and San Francisco
Deep Learning for Public Safety in Chicago and San Francisco
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data Environments
 
Sparkling Water 2.0 - Michal Malohlava
Sparkling Water 2.0 - Michal MalohlavaSparkling Water 2.0 - Michal Malohlava
Sparkling Water 2.0 - Michal Malohlava
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor Smith
 
Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)
 

Semelhante a Sparkling Water Meetup

MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 
Integrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby AmfIntegrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby Amf
railsconf
 
OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL
Suraj Bang
 

Semelhante a Sparkling Water Meetup (20)

MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
 
Amazon elastic map reduce
Amazon elastic map reduceAmazon elastic map reduce
Amazon elastic map reduce
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
Innovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and MonitoringInnovative Specifications for Better Performance Logging and Monitoring
Innovative Specifications for Better Performance Logging and Monitoring
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
Integrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby AmfIntegrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby Amf
 
Flex With Rubyamf
Flex With RubyamfFlex With Rubyamf
Flex With Rubyamf
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
G*なクラウド ~雲のかなたに~
G*なクラウド ~雲のかなたに~G*なクラウド ~雲のかなたに~
G*なクラウド ~雲のかなたに~
 
Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
MongoDB World 2019: Creating a Self-healing MongoDB Replica Set on GCP Comput...
 
Session 40 : SAGA Overview and Introduction
Session 40 : SAGA Overview and Introduction Session 40 : SAGA Overview and Introduction
Session 40 : SAGA Overview and Introduction
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL
 

Mais de Sri Ambati

Mais de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Último (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 

Sparkling Water Meetup

  • 1. Sparkling Water Meetup @h2oai & @mmalohlava presents
  • 2. Memory efficient Performance of computation Machine learning algorithms Parser, GUI, R-interface User-friendly API for data transformation Large and active community Platform components - SQL Multitenancy
  • 3. Sparkling Water Spark H2O HDFS Spark H2O HDFS + RDD
 immutable world DataFrame
 mutable world
  • 4. Sparkling Water Spark H2O HDFS Spark H2O HDFS RDD DataFrame
  • 5. Sparkling Water Provides Transparent integration into Spark ecosystem Pure H2ORDD encapsulating H2O DataFrame Transparent use of H2O data structures and algorithms with Spark API Excels in Spark workflows requiring advanced Machine Learning algorithms
  • 6. Sparkling Water Design spark-submit Spark Master JVM Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O Contains application and Sparkling Water classes Sparkling App implements ?
  • 7. Data Distribution H2O H2O H2O Sparkling Water Cluster Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark Executor JVM Spark Executor JVM Spark RDD RDDs and DataFrames share same memory space
  • 8. Devel Internals Sparkling Water Assembly H2O Core H2O Algos H2O Scala API H2O Flow Sparkling Water Core Spark Platform Spark Core Spark SQL Application Code+ Assembly is deployed to Spark cluster as regular Spark application
  • 10. Sparkling Water Requirements Linux or Mac OS X Oracle Java 1.7+ Spark 1.1.0
  • 12. Where is the code? https://github.com/h2oai/sparkling-water/ blob/master/examples/scripts/
  • 13. Flight delays prediction “Build a model using weather and flight data to predict delays of flights arriving to Chicago O’Hare International Airport”
  • 14. Example Outline Load & Parse CSV data from 2 data sources Use Spark API to filter data, do SQL query for join Create regression models Use models to predict delays Graph residual plot from R
  • 15. Install and Launch Unpack zip file
 
 and 
 
 Point SPARK_HOME to your Spark 1.1.0 installation and
 
 Launch bin/sparkling-shell
  • 16. What is Sparkling Shell? Standard spark-shell With additional Sparkling Water classes export MASTER=“local-cluster[3,2,1024]” spark-shell —-jars sparkling-water.jar JAR containing Sparkling Water Spark Master address
  • 17. Lets play with Sparkling shell…
  • 18. Create H2O Client import org.apache.spark.h2o._ import org.apache.spark.examples.h2o._ 
 val h2oContext = new H2OContext(sc).start(3)
 import h2oContext._ Regular Spark context provided by Spark shell Size of demanded H2O cloud Contains implicit utility functions Demo specific classes
  • 19. Is Spark Running? Go to http://localhost:4040
  • 21. Load Data #1 Load weather data into RDD 
 val weatherDataFile = “examples/smalldata/weather.csv" 
 val wrawdata = sc.textFile(weatherDataFile,3) .cache()
 val weatherTable = wrawdata .map(_.split(“,")) .map(row => WeatherParse(row)) .filter(!_.isWrongRow()) Regular Spark API Ad-hoc Parser
  • 22. Weather Data case class Weather( val Year : Option[Int],
 val Month : Option[Int],
 val Day : Option[Int],
 val TmaxF : Option[Int], // Max temperatur in F
 val TminF : Option[Int], // Min temperatur in F
 val TmeanF : Option[Float], // Mean temperatur in F
 val PrcpIn : Option[Float], // Precipitation (inches)
 val SnowIn : Option[Float], // Snow (inches)
 val CDD : Option[Float], // Cooling Degree Day
 val HDD : Option[Float], // Heating Degree Day
 val GDD : Option[Float]) // Growing Degree Day Simple POJO to hold one row of weather data
  • 23. Load Data #2 Load flights data into H2O frame import java.io.File val dataFile = “examples/smalldata/year2005.csv.gz" 
 val airlinesData = new DataFrame(new File(dataFile)) Shortcut for data load and parse
  • 24. Where is the data? Go to http://localhost:54321/flow/ index.html
  • 25. Use Spark API for Data Filtering // Create RDD wrapper around DataFrame
 val airlinesTable : RDD[Airlines] = asRDD[Airlines](airlinesData)
 // And use Spark RDD API directly
 val flightsToORD = airlinesTable .filter( f => f.Dest == Some(“ORD") ) Regular Spark RDD call Create a cheap wrapper around H2O DataFrame
  • 26. Use Spark SQL to Data Join import org.apache.spark.sql.SQLContext // We need to create SQL context 
 implicit val sqlContext = new SQLContext(sc)
 import sqlContext._ 
 flightsToORD.registerTempTable("FlightsToORD")
 weatherTable.registerTempTable("WeatherORD") Make context implicit to share it with h2oContext
  • 27. Join Data based on Flight Date val joinedTable = sql(
 """SELECT
 | f.Year,f.Month,f.DayofMonth,
 | f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,
 | f.UniqueCarrier,f.FlightNum,f.TailNum,
 | f.Origin,f.Distance,
 | w.TmaxF,w.TminF,w.TmeanF, | w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
 | f.ArrDelay
 | FROM FlightsToORD f
 | JOIN WeatherORD w
 | ON f.Year=w.Year AND f.Month=w.Month | AND f.DayofMonth=w.Day""".stripMargin)
  • 28. Split data import hex.splitframe.SplitFrame
 import hex.splitframe.SplitFrameModel.SplitFrameParameters val sfParams = new SplitFrameParameters()
 sfParams._train = joinedTable
 sfParams._ratios = Array(0.7, 0.2)
 val sf = new SplitFrame(sfParams)
 
 val splits = sf.trainModel().get._output._splits
 val trainTable = splits(0)
 val validTable = splits(1)
 val testTable = splits(2) Result of SQL query is implicitly converted into H2O DataFrame
  • 29. Launch H2O Algorithms import hex.deeplearning._
 import hex.deeplearning.DeepLearningModel .DeepLearningParameters 
 // Setup deep learning parameters val dlParams = new DeepLearningParameters()
 dlParams._train = trainTable
 dlParams._response_column = 'ArrDelay
 dlParams._valid = validTable
 dlParams._epochs = 100
 dlParams._reproducible = true
 dlParams._force_load_balance = false // Create a new model builder val dl = new DeepLearning(dlParams)
 val dlModel = dl.trainModel.get Blocking call
  • 30. Make a prediction // Use model to score data val dlPredictTable = dlModel.score(testTable)(‘predict) // Collect predicted values via RDD API
 val predictionValues = toShemaRDD(dlPredictTable)
 .collect
 .map (row => if (row.isNullAt(0)) Double.NaN else row(0)
  • 31. Hands-On #2
 
 Can I access results from R? YES!
  • 33. Install R package You can find R package 
 on USB stick 1. Open RStudio 2. Click on 
 “Install Packages” 3. Select 
 h2o_0.1.20.99999.tar.gz 
 file from USB
  • 34. Generate R code import org.apache.spark.examples.h2o.DemoUtils.residualPlotRCode residualPlotRCode( predictionH2OFrame, 'predict, testFrame, 'ArrDelay) Utility generating
 R code to show
 residuals plot for 
 predicted and actual
 values In Sparkling Shell:
  • 35. Residuals Plot in R # Import H2O library and initialize H2O client library(h2o) h = h2o.init() # Fetch prediction and actual data, use remembered keys pred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69") act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57") # Select right columns predDelay = pred$predict actDelay = act$ArrDelay # Make sure that number of rows is same nrow(actDelay) == nrow(predDelay) # Compute residuals residuals = predDelay - actDelay # Plot residuals compare = cbind( as.data.frame(actDelay$ArrDelay), as.data.frame(residuals$predict)) plot( compare[,1:2] ) References of data residuals
  • 36. If you are running R v3.1.0 you will see different plot: Warning! Why? Float number handling was changed in that version. Our recommendation is to upgrade your R to the newest version.
  • 37. Try GBM Algo import hex.tree.gbm.GBM
 import hex.tree.gbm.GBMModel.GBMParameters
 val gbmParams = new GBMParameters()
 gbmParams._train = trainTable
 gbmParams._response_column = 'ArrDelay
 gbmParams._valid = validTable
 gbmParams._ntrees = 100
 
 val gbm = new GBM(gbmParams)
 val gbmModel = gbm.trainModel.get // Print R code for residual plot
 val gbmPredictTable = gbmModel.score(testTable)('predict)
 printf( residualPlotRCode(gbmPredictTable, 'predict, testTable, 'ArrDelay) )
  • 38. Residuals plot for GBM predictionresiduals
  • 39. Hands-On #3 
 How Can I Develop and Run Standalone App?
  • 41. Use Sparkling Water Droplet Clone H2O Droplets repository git clone https://github.com/h2oai/h2o-droplets.git cd h2o-droplets/sparkling-water-droplet/
  • 42. Generate IDE project For Idea For Eclipse ./gradlew idea ./gradlew eclipse … add import project into your IDE
  • 43. Create An Application object AirlinesWeatherAnalysis {
 
 /** Entry point */
 def main(args: Array[String]) {
 // Configure this application
 val conf: SparkConf = new SparkConf().setAppName("Flights Water")
 conf.setIfMissing("spark.master", sys.env.getOrElse("spark.master", "local"))
 
 // Create SparkContext to execute application on Spark cluster
 val sc = new SparkContext(conf)
 // Start H2O cluster only
 new H2OContext(sc).start()
 
 // User code 
 // . . .
 }
 } Create Spark Context Create H2O context and start H2O on top of Spark
  • 44. Build the Application ./gradlew build shadowJar Create an assembly
 which can be submitted
 to Spark cluster Build and test
  • 45. Run code on Spark #!/usr/bin/env bash APP_CLASS=water.droplets.AirlineWeatherAnalysis FAT_JAR_FILE=“build/libs/sparkling-water-droplet-app.jar” MASTER=${MASTER:-"local-cluster[3,2,1024]"} DRIVER_MEMORY=2g $SPARK_HOME/bin/spark-submit "$@" --driver-memory $DRIVER_MEMORY --master $MASTER --class "$APP_CLASS" $FAT_JAR_FILE
  • 46. It is Open Source! You can participate in H2O Scala API Sparkling Water testing Mesos, Yarn, workflows (PUBDEV-23,26,27,31-33) Spark integration MLLib Pipelines Check out our JIRA at http://jira.h2o.ai
  • 48. More info Checkout H2O.ai Training Books http://learn.h2o.ai/
 Checkout H2O.ai Blog for Sparkling Water tutorials http://h2o.ai/blog/
 Checkout H2O.ai Youtube Channel https://www.youtube.com/user/0xdata
 Checkout GitHub https://github.com/h2oai/sparkling-water
  • 49. Learn more about H2O at h2o.ai or Thank you! Follow us at @h2oai > for r in sparkling-water; do git clone “git@github.com:h2oai/$r.git” done
  • 50. And the winner is …