SlideShare uma empresa Scribd logo
1 de 30
Scalable Machine 
Learning
me: 
Sam Bessalah 
Software Engineer, Freelance 
Big Data, Distributed Computing, Machine Learning 
Paris Data Geek Co-organizer 
@samklr @DataParis
Machine Learning Land 
VOWPAL WABBIT
Some Observations in Big Data Land 
● New use cases push towards faster execution platforms and real 
time predictions engines. 
● Traditional MapReduce on Hadoop is fading away, especially for 
Machine Learning 
● Apache Spark has become the darling of the Big Data world, 
thanks to its high level API and performances. 
● Rise of Machine Learning public APIs to easily integrate models 
into application and other data processing workflows.
● Used to be the only Hadoop MapReduce Framework 
● Moved from MapReduce towards modern and faster 
backends, namely 
● Now provide a fluent DSL that integrates with Scala and 
Spark
Mahout Example 
Simple Co-occurence analysis in Mahout 
val A = 
drmFromHDFS (“ hdfs://nivdul/babygirl.txt“) 
val cooccurencesMatrix = A.t %*% A 
val numInteractions = 
drmBroadcast(A.colsums) 
val I = C.mapBlock(){ 
case (keys, block) => 
val indicatorBlock = sparse(row, col) 
for (r <- block ) 
indicatorBlock = computeLLR (row, nbInt) 
keys <- indicatorblock 
}
Dataflow system, materialized by immutable and lazy, in-memory distributed 
collections suited for iterative and complex transformations, like in most Machine 
Learning algorithms. 
Those in-memory collections are called Resilient Distributed Datasets (RDD) 
They provide : 
● Partitioned data 
● High level operations (map, filter, collect, reduce, zip, join, sample, etc …) 
● No side effects 
● Fault recovery via lineage
Some operations on RDDs
Spark 
Ecosystem
MLlib 
Machine Learning library within Spark : 
● Provides an integrated predictive and data analysis 
workflow 
● Broad collections of algorithms and applications 
● Integrates with the whole Spark Ecosystem 
Three APIs in :
Algorithms in MLlib
Example: Clustering via K-means 
// Load and parse data 
val data = sc.textFile(“hdfs://bbgrl/dataset.txt”) 
val parsedData = data.map { x => 
Vectors.dense(x.split(“ “).map.(_.toDouble )) 
}.cache() 
//Cluster data into 5 classes using K-means 
val clusters = Kmeans.train(parsedData, k=5, numIterations=20 ) 
//Evaluate model error 
val cost = clusters.computeCost(parsedData)
Coming to Spark 1.2 
● Ensembles of decision trees : Random Forests 
● Boosting 
● Topic modeling 
● Streaming Kmeans 
● A pipeline interface for machine workflows 
A lot of contributions from the community
Machine Learning Pipeline 
Typical machine learning workflows are complex ! 
Coming in next iterations of MLLib
● H20 is a fast (really fast), statistics, Machine Learning 
and maths engine on the JVM. 
● Edited by 0xdata (commercial entity) and focus on 
bringing robust and highly performant machine learning 
algorithms to popular Big Data workloads. 
● Has APIs in R, Java, Scala and Python and integrates 
to third parties tools like Tableau and Excel.
Example in R 
library(h2o) 
localH2O = h2o.init(ip = 'localhost', port = 54321) 
irisPath = system.file("extdata", "iris.csv", package="h2o") 
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex") 
iris.data.frame <- as.data.frame(iris.hex) 
> colnames(iris.hex) 
[1] "C1" "C2" "C3" "C4" "C5" 
>
Simple Logistic Regressioon to predict prostate cancer outcomes: 
> prostate.hex = h2o.importFile(localH2O, 
path="https://raw.github.com/0xdata/h2o/../prostate.csv", 
key = "prostate.hex") 
> prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","DCAPS"), 
data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5) 
> prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.hex)
> (prostate.fit) 
IP Address: 127.0.0.1 
Port : 54321 
Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9 
predict X0 X1 
1 0 0.7452267 0.2547732 
2 1 0.3969807 0.6030193 
3 1 0.4120950 0.5879050 
4 1 0.3726134 0.6273866 
5 1 0.6465137 0.3534863 
6 1 0.4331880 0.5668120
Sparkling Water 
Transparent use of H2O data and algorithms with the Spark API. 
Provides a custom RDD : H2ORDD
val sqlContext = new SQLContext(sc) 
import sqlContext._ 
airlinesTable.registerTempTable("airlinesTable") //H20 methods 
val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest 
LIKE 'SJC' OR Dest LIKE 'OAK'“ 
val result = sql(query) 
result.count
Same but with Spark API 
// H2O Context provide useful implicits for conversions 
val h2oContext = new H2OContext(sc) 
import h2oContext._ 
// Create RDD wrapper around DataFrame 
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) 
airlinesTable.count 
// And use Spark RDD API directly 
val flightsOnlyToSF = airlinesTable.filter(f => 
f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") 
) 
flightsOnlyToSF.count
Build a model 
import hex.deeplearning._ 
import hex.deeplearning.DeepLearningModel.DeepLearningParameters 
val dlParams = new DeepLearningParameters() 
dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek, 
'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 
FlightNum, 'TailNum, 'CRSElapsedTime, 
'Origin, 'Dest,'Distance,‘IsDepDelayed) 
dlParams.response_column = 'IsDepDelayed.name 
// Create a new model builder 
val dl = new DeepLearning(dlParams) 
val dlModel = dl.train.get
Predict 
// Use model to score data 
val prediction = dlModel.score(result)(‘predict) 
// Collect predicted values via the RDD API 
val predictionValues = toRDD[DoubleHolder](prediction) 
.collect 
.map ( _.result.getOrElse("NaN") )
Slides: http://speakerdeck.com/samklr/

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 
Scalding
ScaldingScalding
Scalding
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 

Semelhante a scalable machine learning

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 

Semelhante a scalable machine learning (20)

High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 

Mais de Samir Bessalah (7)

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Eventual Consitency with CRDTS
Eventual Consitency with CRDTSEventual Consitency with CRDTS
Eventual Consitency with CRDTS
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015
 
High Performance RPC with Finagle
High Performance RPC with FinagleHigh Performance RPC with Finagle
High Performance RPC with Finagle
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Structures de données exotiques
Structures de données exotiquesStructures de données exotiques
Structures de données exotiques
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

scalable machine learning

  • 2. me: Sam Bessalah Software Engineer, Freelance Big Data, Distributed Computing, Machine Learning Paris Data Geek Co-organizer @samklr @DataParis
  • 3. Machine Learning Land VOWPAL WABBIT
  • 4. Some Observations in Big Data Land ● New use cases push towards faster execution platforms and real time predictions engines. ● Traditional MapReduce on Hadoop is fading away, especially for Machine Learning ● Apache Spark has become the darling of the Big Data world, thanks to its high level API and performances. ● Rise of Machine Learning public APIs to easily integrate models into application and other data processing workflows.
  • 5. ● Used to be the only Hadoop MapReduce Framework ● Moved from MapReduce towards modern and faster backends, namely ● Now provide a fluent DSL that integrates with Scala and Spark
  • 6.
  • 7. Mahout Example Simple Co-occurence analysis in Mahout val A = drmFromHDFS (“ hdfs://nivdul/babygirl.txt“) val cooccurencesMatrix = A.t %*% A val numInteractions = drmBroadcast(A.colsums) val I = C.mapBlock(){ case (keys, block) => val indicatorBlock = sparse(row, col) for (r <- block ) indicatorBlock = computeLLR (row, nbInt) keys <- indicatorblock }
  • 8. Dataflow system, materialized by immutable and lazy, in-memory distributed collections suited for iterative and complex transformations, like in most Machine Learning algorithms. Those in-memory collections are called Resilient Distributed Datasets (RDD) They provide : ● Partitioned data ● High level operations (map, filter, collect, reduce, zip, join, sample, etc …) ● No side effects ● Fault recovery via lineage
  • 11. MLlib Machine Learning library within Spark : ● Provides an integrated predictive and data analysis workflow ● Broad collections of algorithms and applications ● Integrates with the whole Spark Ecosystem Three APIs in :
  • 13. Example: Clustering via K-means // Load and parse data val data = sc.textFile(“hdfs://bbgrl/dataset.txt”) val parsedData = data.map { x => Vectors.dense(x.split(“ “).map.(_.toDouble )) }.cache() //Cluster data into 5 classes using K-means val clusters = Kmeans.train(parsedData, k=5, numIterations=20 ) //Evaluate model error val cost = clusters.computeCost(parsedData)
  • 14.
  • 15. Coming to Spark 1.2 ● Ensembles of decision trees : Random Forests ● Boosting ● Topic modeling ● Streaming Kmeans ● A pipeline interface for machine workflows A lot of contributions from the community
  • 16. Machine Learning Pipeline Typical machine learning workflows are complex ! Coming in next iterations of MLLib
  • 17. ● H20 is a fast (really fast), statistics, Machine Learning and maths engine on the JVM. ● Edited by 0xdata (commercial entity) and focus on bringing robust and highly performant machine learning algorithms to popular Big Data workloads. ● Has APIs in R, Java, Scala and Python and integrates to third parties tools like Tableau and Excel.
  • 18.
  • 19. Example in R library(h2o) localH2O = h2o.init(ip = 'localhost', port = 54321) irisPath = system.file("extdata", "iris.csv", package="h2o") iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex") iris.data.frame <- as.data.frame(iris.hex) > colnames(iris.hex) [1] "C1" "C2" "C3" "C4" "C5" >
  • 20. Simple Logistic Regressioon to predict prostate cancer outcomes: > prostate.hex = h2o.importFile(localH2O, path="https://raw.github.com/0xdata/h2o/../prostate.csv", key = "prostate.hex") > prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","DCAPS"), data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5) > prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.hex)
  • 21. > (prostate.fit) IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9 predict X0 X1 1 0 0.7452267 0.2547732 2 1 0.3969807 0.6030193 3 1 0.4120950 0.5879050 4 1 0.3726134 0.6273866 5 1 0.6465137 0.3534863 6 1 0.4331880 0.5668120
  • 22. Sparkling Water Transparent use of H2O data and algorithms with the Spark API. Provides a custom RDD : H2ORDD
  • 23.
  • 24.
  • 25. val sqlContext = new SQLContext(sc) import sqlContext._ airlinesTable.registerTempTable("airlinesTable") //H20 methods val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ val result = sql(query) result.count
  • 26. Same but with Spark API // H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc) import h2oContext._ // Create RDD wrapper around DataFrame val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) airlinesTable.count // And use Spark RDD API directly val flightsOnlyToSF = airlinesTable.filter(f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count
  • 27. Build a model import hex.deeplearning._ import hex.deeplearning.DeepLearningModel.DeepLearningParameters val dlParams = new DeepLearningParameters() dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance,‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name // Create a new model builder val dl = new DeepLearning(dlParams) val dlModel = dl.train.get
  • 28. Predict // Use model to score data val prediction = dlModel.score(result)(‘predict) // Collect predicted values via the RDD API val predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )
  • 29.

Notas do Editor

  1. c’est où le chat ?