SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 1/6
Machine Learning with Spark
Scikit Learn Cheat Sheet
Load basic dependencies
> 
inputCsvDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/csv
ouputParquetDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/
import java.util.Base64
import java.nio.charset.StandardCharsets
encB64: (str: String)String
decB64: (str: String)String
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
import java.net.URI
import org.apache.hadoop.fs.FileStatus
listS3: (s3Path: String)Array[org.apache.hadoop.fs.FileStatus]
ls3: (s3FolderPath: String)Unit
rm3: (s3Path: String)Boolean
? s3a://bigpicture-guild/nyctaxi/sample_1_month/csv/trip_data_and_fare.csv.gz [967.71 MiB]
Read taxi data as dataframe from parquet
> 
%run "/meetup/kickoff/connect_s3"
// read Parquet files
val parquetTable= sqlContext.read.parquet(ouputParquetDir)
val toDouble = udf[Double, Float]( _.toDouble)
val taxiData = parquetTable.withColumn("tip_amount_d", toDouble(parquetTable.col("tip_amount")))
(http://databricks.com)  Import Notebook
MachineLearning
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 2/6
> 
> 

Showing the first 1000 rows.
2D4B95E2FA7B2E85118EC5CA4570FA58 CD2F522EEE1FF5F5A8D8B679E23576B3 CMT 1 N 2013-01-
07T15:33:28.000+0000
0C5296F3C8B16E702F8F2E06F5106552 D2363240A9295EF570FC6069BC4F4C92 CMT 1 N 2013-01-
07T22:25:46.000+0000
312E0CB058D7FC1A6494EDB66D360CD2 7B5156F38990963332B33298C8BAE25E CMT 1 N 2013-01-
05T11:54:49.000+0000
DD98E2C3AF5C47B4449F720ECC5778D4 79807332B275653A2473554C7328500A CMT 1 N 2013-01-
02T06:58:08.000+0000
0B57B9633A2FECD3D3B1944AFC7471CF CCD4367B417ED6634D986F573A552A62 CMT 1 N 2013-01-
07T14:46:55.000+0000
Scatter plot for tip amount and fare amount
> 
500m
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
5.50
6.00
6.50
7.00
7.50
8.00
8.50
9.00
9.50
5.00 10.0 15.0 20.0 25.0 30.0 35.0 40.0
fare_amount
tip_amount
Showing sample based on the first 1000 rows.
Transformation of data with standard dataframe operations
> 
The pipeline concept of Spark ML
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime
taxiData.registerTempTable("ml_nyc_taxi")
%sql SELECT * FROM ml_nyc_taxi
%sql SELECT tip_amount, fare_amount FROM ml_nyc_taxi WHERE tip_amount > 0 AND tip_amount < 10 AND fare_amount < 50
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val toVec   = udf[Vector, Int, Float] { (a, b) => Vectors.dense(a, b) }
val trainingData =
taxiData
.filter(toDouble(taxiData.col("tip_amount")) > 0.0)
.withColumn("label", toDouble(taxiData.col("tip_amount")))
.withColumn("features", toVec(taxiData.col("passenger_count"), taxiData.col("fare_amount")))
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 3/6
A Pipeline chains Transformers and Estimators
A Transformer can also be an estimator from a previous trained model
Important for easily
training with different model parameters e.g. for cross-validation
with different test and training data (train-validation split)
repeat the transformation steps before estimation
Watch out for KeyStoneML (http://keystone-ml.org (http://keystone-ml.org)), a ML pipeline framework with a richer set of operators
on Spark
SQL transformer:
Select and filter the relevant data
> 
VectorAssembler:
Transform the data into labeled data as needed for ML estimators
> 
+------------------+----------+
| label| features|
+------------------+----------+
|1.2000000476837158| [1.0,5.5]|
| 4.199999809265137|[1.0,20.5]|
| 5.900000095367432|[1.0,29.0]|
| 5.380000114440918|[1.0,21.0]|
| 1.399999976158142| [6.0,6.5]|
| 1.0| [1.0,5.0]|
| 1.25| [1.0,4.5]|
| 3.0|[6.0,26.0]|
| 1.0|[1.0,14.5]|
|1.2999999523162842| [1.0,6.5]|
| 1.899999976158142| [5.0,9.5]|
|1.6200000047683716| [1.0,6.5]|
| 1.899999976158142| [1.0,9.0]|
| 2.0|[1.0,22.0]|
| 6.0|[1.0,25.0]|
|3.5999999046325684|[1.0,17.5]|
|1.2000000476837158| [1.0,6.0]|
| 7.5|[1.0,24.5]|
Initialize the estimator
import org.apache.spark.ml.feature.SQLTransformer
val taxiDataSelector = new SQLTransformer().setStatement(
"SELECT tip_amount_d as label, passenger_count, fare_amount FROM ml_nyc_taxi WHERE tip_amount_d > 0")
val selectedTaxiData = taxiDataSelector.transform(taxiData)
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
val trainingDataAssembler = new VectorAssembler()
.setInputCols(Array("passenger_count", "fare_amount"))
.setOutputCol("features")
val assembledTaxiData = trainingDataAssembler.transform(selectedTaxiData)
assembledTaxiData.select("label", "features").show()
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 4/6
> 
LogisticRegression parameters:
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha =
1, it is an L1 penalty (default: 0.0, current: 0.8)
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100, current: 10)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3)
solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto'. (default: auto)
standardization: whether to standardize the training features before fitting the model (default: true)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (default: )
import org.apache.spark.ml.regression.LinearRegression
linearRegressionEstimator: org.apache.spark.ml.regression.LinearRegression = linReg_54024ee673fd
Split the data into training and test set
> 
Setup the transformation and estimation PIPELINE
> 
Use the pipeline to train the model
> 
Predict with the trained model on the test data
> 
5.00
10.0
15.0
20.0
25.0
30.0
35.0
5.00 10.0 15.0
prediction
label
Showing sample based on the first 1000 rows.
How to get started with Spark ML
Setup your Laptop (16+ GB RAM recommended)
import org.apache.spark.ml.regression.LinearRegression
// Create a LogisticRegression instance. This instance is an Estimator.
val linearRegressionEstimator = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Print out the parameters, documentation, and any default values.
println("LogisticRegression parameters:n" + linearRegressionEstimator.explainParams() + "n")
val Array(trainingTaxiData, testTaxiData) = taxiData.randomSplit(Array(0.9, 0.1), seed = 12345)
import org.apache.spark.ml.{Pipeline, PipelineModel}
val pipeline = new Pipeline().setStages(Array(taxiDataSelector, trainingDataAssembler, linearRegressionEstimator))
// Learn a LogisticRegression model.
// val lrModel = linearRegressionEstimator.fit(trainingData)
val lrModel = pipeline.fit(trainingTaxiData)
display(lrModel.transform(testTaxiData)
.select("label", "prediction"))
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 5/6
mac$ brew install spark
or get Databricks Community Edition Notebook (Wait List)
Get data
Join a ML competition and get BIG data from Kaggle
Analyze the Panama Papers: https://github.com/amaboura/panama-papers-dataset-2016
(https://github.com/amaboura/panama-papers-dataset-2016)
Visualize the data (Databricks or Zeppelin Notebook: https://zeppelin.incubator.apache.org/
(https://zeppelin.incubator.apache.org/))
Throw some algorithms on it !
? have a coffee
? and maybe read the docs ? http://spark.apache.org/docs/latest/mllib-guide.html (http://spark.apache.org/docs/latest/mllib-
guide.html)
? read the Kaggle competition forums and blog
Graphs from the Panama Papers
19.4.2016 MachineLearning - Databricks
file:///Users/lhaferkamp/Downloads/MachineLearning.html 6/6

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
Meet scala
Meet scalaMeet scala
Meet scala
 
The Ring programming language version 1.9 book - Part 33 of 210
The Ring programming language version 1.9 book - Part 33 of 210The Ring programming language version 1.9 book - Part 33 of 210
The Ring programming language version 1.9 book - Part 33 of 210
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
MCE^3 - Hannes Verlinde - Let The Symbols Do The WorkMCE^3 - Hannes Verlinde - Let The Symbols Do The Work
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
 
Grid gain paper
Grid gain paperGrid gain paper
Grid gain paper
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading example
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 
Chapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for printChapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for print
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
 
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco SlotDistributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
Data Love Conference - Window Functions for Database Analytics
Data Love Conference - Window Functions for Database AnalyticsData Love Conference - Window Functions for Database Analytics
Data Love Conference - Window Functions for Database Analytics
 

Semelhante a Machinelearning Spark Hadoop User Group Munich Meetup 2016

Save Coding Time with Proc SQL.ppt
Save Coding Time with Proc SQL.pptSave Coding Time with Proc SQL.ppt
Save Coding Time with Proc SQL.ppt
ssuser660bb1
 
Tony Jambu (obscure) tools of the trade for tuning oracle sq ls
Tony Jambu   (obscure) tools of the trade for tuning oracle sq lsTony Jambu   (obscure) tools of the trade for tuning oracle sq ls
Tony Jambu (obscure) tools of the trade for tuning oracle sq ls
InSync Conference
 

Semelhante a Machinelearning Spark Hadoop User Group Munich Meetup 2016 (20)

Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
M|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersM|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data Adapters
 
Device status anomaly detection
Device status anomaly detectionDevice status anomaly detection
Device status anomaly detection
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Tony jambu (obscure) tools of the trade for tuning oracle sq ls
Tony jambu   (obscure) tools of the trade for tuning oracle sq lsTony jambu   (obscure) tools of the trade for tuning oracle sq ls
Tony jambu (obscure) tools of the trade for tuning oracle sq ls
 
Database programming
Database programmingDatabase programming
Database programming
 
Save Coding Time with Proc SQL.ppt
Save Coding Time with Proc SQL.pptSave Coding Time with Proc SQL.ppt
Save Coding Time with Proc SQL.ppt
 
Shifu plugin-trainer and pmml-adapter
Shifu plugin-trainer and pmml-adapterShifu plugin-trainer and pmml-adapter
Shifu plugin-trainer and pmml-adapter
 
Machine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsMachine Learning Model for M.S admissions
Machine Learning Model for M.S admissions
 
ADO.Net Improvements in .Net 2.0
ADO.Net Improvements in .Net 2.0ADO.Net Improvements in .Net 2.0
ADO.Net Improvements in .Net 2.0
 
Tony Jambu (obscure) tools of the trade for tuning oracle sq ls
Tony Jambu   (obscure) tools of the trade for tuning oracle sq lsTony Jambu   (obscure) tools of the trade for tuning oracle sq ls
Tony Jambu (obscure) tools of the trade for tuning oracle sq ls
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19c
 
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
 
Aspects of 10 Tuning
Aspects of 10 TuningAspects of 10 Tuning
Aspects of 10 Tuning
 
Viktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning Service
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 

Mais de Comsysto Reply GmbH

MicroFrontends für Microservices
MicroFrontends für MicroservicesMicroFrontends für Microservices
MicroFrontends für Microservices
Comsysto Reply GmbH
 

Mais de Comsysto Reply GmbH (20)

Architectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyArchitectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and Consistently
 
ljug-meetup-2023-03-hexagonal-architecture.pdf
ljug-meetup-2023-03-hexagonal-architecture.pdfljug-meetup-2023-03-hexagonal-architecture.pdf
ljug-meetup-2023-03-hexagonal-architecture.pdf
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuable
 
Invited-Talk_PredAnalytics_München (2).pdf
Invited-Talk_PredAnalytics_München (2).pdfInvited-Talk_PredAnalytics_München (2).pdf
Invited-Talk_PredAnalytics_München (2).pdf
 
MicroFrontends für Microservices
MicroFrontends für MicroservicesMicroFrontends für Microservices
MicroFrontends für Microservices
 
Alles offen = gut(ai)
Alles offen = gut(ai)Alles offen = gut(ai)
Alles offen = gut(ai)
 
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
 
Smart City Munich Kickoff Meetup
Smart City Munich Kickoff Meetup Smart City Munich Kickoff Meetup
Smart City Munich Kickoff Meetup
 
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
 
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo..."Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
 
Data lake vs Data Warehouse: Hybrid Architectures
Data lake vs Data Warehouse: Hybrid ArchitecturesData lake vs Data Warehouse: Hybrid Architectures
Data lake vs Data Warehouse: Hybrid Architectures
 
Java 9 Modularity and Project Jigsaw
Java 9 Modularity and Project JigsawJava 9 Modularity and Project Jigsaw
Java 9 Modularity and Project Jigsaw
 
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Distributed Computing and Caching in the Cloud: Hazelcast and MicrosoftDistributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
 
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Grundlegende Konzepte von Elm, React und AngularDart 2 im VergleichGrundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
 
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformBuilding a fully-automated Fast Data Platform
Building a fully-automated Fast Data Platform
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMNEin Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
 
Geospatial applications created using java script(and nosql)
Geospatial applications created using java script(and nosql)Geospatial applications created using java script(and nosql)
Geospatial applications created using java script(and nosql)
 
Java cro 2016 - From.... to Scrum by Jurica Krizanic
Java cro 2016 - From.... to Scrum by Jurica KrizanicJava cro 2016 - From.... to Scrum by Jurica Krizanic
Java cro 2016 - From.... to Scrum by Jurica Krizanic
 
21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink
 

Último

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Último (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Machinelearning Spark Hadoop User Group Munich Meetup 2016

  • 1. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 1/6 Machine Learning with Spark Scikit Learn Cheat Sheet Load basic dependencies >  inputCsvDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/csv ouputParquetDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/ import java.util.Base64 import java.nio.charset.StandardCharsets encB64: (str: String)String decB64: (str: String)String import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.conf.Configuration import java.net.URI import org.apache.hadoop.fs.FileStatus listS3: (s3Path: String)Array[org.apache.hadoop.fs.FileStatus] ls3: (s3FolderPath: String)Unit rm3: (s3Path: String)Boolean ? s3a://bigpicture-guild/nyctaxi/sample_1_month/csv/trip_data_and_fare.csv.gz [967.71 MiB] Read taxi data as dataframe from parquet >  %run "/meetup/kickoff/connect_s3" // read Parquet files val parquetTable= sqlContext.read.parquet(ouputParquetDir) val toDouble = udf[Double, Float]( _.toDouble) val taxiData = parquetTable.withColumn("tip_amount_d", toDouble(parquetTable.col("tip_amount"))) (http://databricks.com)  Import Notebook MachineLearning
  • 2. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 2/6 >  >   Showing the first 1000 rows. 2D4B95E2FA7B2E85118EC5CA4570FA58 CD2F522EEE1FF5F5A8D8B679E23576B3 CMT 1 N 2013-01- 07T15:33:28.000+0000 0C5296F3C8B16E702F8F2E06F5106552 D2363240A9295EF570FC6069BC4F4C92 CMT 1 N 2013-01- 07T22:25:46.000+0000 312E0CB058D7FC1A6494EDB66D360CD2 7B5156F38990963332B33298C8BAE25E CMT 1 N 2013-01- 05T11:54:49.000+0000 DD98E2C3AF5C47B4449F720ECC5778D4 79807332B275653A2473554C7328500A CMT 1 N 2013-01- 02T06:58:08.000+0000 0B57B9633A2FECD3D3B1944AFC7471CF CCD4367B417ED6634D986F573A552A62 CMT 1 N 2013-01- 07T14:46:55.000+0000 Scatter plot for tip amount and fare amount >  500m 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.50 6.00 6.50 7.00 7.50 8.00 8.50 9.00 9.50 5.00 10.0 15.0 20.0 25.0 30.0 35.0 40.0 fare_amount tip_amount Showing sample based on the first 1000 rows. Transformation of data with standard dataframe operations >  The pipeline concept of Spark ML medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime taxiData.registerTempTable("ml_nyc_taxi") %sql SELECT * FROM ml_nyc_taxi %sql SELECT tip_amount, fare_amount FROM ml_nyc_taxi WHERE tip_amount > 0 AND tip_amount < 10 AND fare_amount < 50 import org.apache.spark.mllib.linalg.{Vector, Vectors} val toVec   = udf[Vector, Int, Float] { (a, b) => Vectors.dense(a, b) } val trainingData = taxiData .filter(toDouble(taxiData.col("tip_amount")) > 0.0) .withColumn("label", toDouble(taxiData.col("tip_amount"))) .withColumn("features", toVec(taxiData.col("passenger_count"), taxiData.col("fare_amount")))
  • 3. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 3/6 A Pipeline chains Transformers and Estimators A Transformer can also be an estimator from a previous trained model Important for easily training with different model parameters e.g. for cross-validation with different test and training data (train-validation split) repeat the transformation steps before estimation Watch out for KeyStoneML (http://keystone-ml.org (http://keystone-ml.org)), a ML pipeline framework with a richer set of operators on Spark SQL transformer: Select and filter the relevant data >  VectorAssembler: Transform the data into labeled data as needed for ML estimators >  +------------------+----------+ | label| features| +------------------+----------+ |1.2000000476837158| [1.0,5.5]| | 4.199999809265137|[1.0,20.5]| | 5.900000095367432|[1.0,29.0]| | 5.380000114440918|[1.0,21.0]| | 1.399999976158142| [6.0,6.5]| | 1.0| [1.0,5.0]| | 1.25| [1.0,4.5]| | 3.0|[6.0,26.0]| | 1.0|[1.0,14.5]| |1.2999999523162842| [1.0,6.5]| | 1.899999976158142| [5.0,9.5]| |1.6200000047683716| [1.0,6.5]| | 1.899999976158142| [1.0,9.0]| | 2.0|[1.0,22.0]| | 6.0|[1.0,25.0]| |3.5999999046325684|[1.0,17.5]| |1.2000000476837158| [1.0,6.0]| | 7.5|[1.0,24.5]| Initialize the estimator import org.apache.spark.ml.feature.SQLTransformer val taxiDataSelector = new SQLTransformer().setStatement( "SELECT tip_amount_d as label, passenger_count, fare_amount FROM ml_nyc_taxi WHERE tip_amount_d > 0") val selectedTaxiData = taxiDataSelector.transform(taxiData) import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.mllib.linalg.Vectors val trainingDataAssembler = new VectorAssembler() .setInputCols(Array("passenger_count", "fare_amount")) .setOutputCol("features") val assembledTaxiData = trainingDataAssembler.transform(selectedTaxiData) assembledTaxiData.select("label", "features").show()
  • 4. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 4/6 >  LogisticRegression parameters: elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0, current: 0.8) featuresCol: features column name (default: features) fitIntercept: whether to fit an intercept term (default: true) labelCol: label column name (default: label) maxIter: maximum number of iterations (>= 0) (default: 100, current: 10) predictionCol: prediction column name (default: prediction) regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3) solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto'. (default: auto) standardization: whether to standardize the training features before fitting the model (default: true) tol: the convergence tolerance for iterative algorithms (default: 1.0E-6) weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (default: ) import org.apache.spark.ml.regression.LinearRegression linearRegressionEstimator: org.apache.spark.ml.regression.LinearRegression = linReg_54024ee673fd Split the data into training and test set >  Setup the transformation and estimation PIPELINE >  Use the pipeline to train the model >  Predict with the trained model on the test data >  5.00 10.0 15.0 20.0 25.0 30.0 35.0 5.00 10.0 15.0 prediction label Showing sample based on the first 1000 rows. How to get started with Spark ML Setup your Laptop (16+ GB RAM recommended) import org.apache.spark.ml.regression.LinearRegression // Create a LogisticRegression instance. This instance is an Estimator. val linearRegressionEstimator = new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) // Print out the parameters, documentation, and any default values. println("LogisticRegression parameters:n" + linearRegressionEstimator.explainParams() + "n") val Array(trainingTaxiData, testTaxiData) = taxiData.randomSplit(Array(0.9, 0.1), seed = 12345) import org.apache.spark.ml.{Pipeline, PipelineModel} val pipeline = new Pipeline().setStages(Array(taxiDataSelector, trainingDataAssembler, linearRegressionEstimator)) // Learn a LogisticRegression model. // val lrModel = linearRegressionEstimator.fit(trainingData) val lrModel = pipeline.fit(trainingTaxiData) display(lrModel.transform(testTaxiData) .select("label", "prediction"))
  • 5. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 5/6 mac$ brew install spark or get Databricks Community Edition Notebook (Wait List) Get data Join a ML competition and get BIG data from Kaggle Analyze the Panama Papers: https://github.com/amaboura/panama-papers-dataset-2016 (https://github.com/amaboura/panama-papers-dataset-2016) Visualize the data (Databricks or Zeppelin Notebook: https://zeppelin.incubator.apache.org/ (https://zeppelin.incubator.apache.org/)) Throw some algorithms on it ! ? have a coffee ? and maybe read the docs ? http://spark.apache.org/docs/latest/mllib-guide.html (http://spark.apache.org/docs/latest/mllib- guide.html) ? read the Kaggle competition forums and blog Graphs from the Panama Papers
  • 6. 19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 6/6