SlideShare uma empresa Scribd logo
1 de 186
Baixar para ler offline
– 





Yung-Chuan Lee
2016.12.18
1
2
Law[Data applications] are like sausages. It is better not to see them being made.
—Otto von Bismarck
! Spark
◦
● Spark
● Scala
● RDD
! LAB
◦ ~
● Spark Scala IDE
! Spark MLlib
◦ …
● Scala + lambda + Spark MLlib
● Clustering Classification Regression
3
! github page: https://github.com/yclee0418/sparkTeach
◦ installation: Spark
◦ codeSample: Spark
● exercise -
● https://github.com/yclee0418/sparkTeach/tree/master/
codeSample/exercise
● final -
● https://github.com/yclee0418/sparkTeach/tree/master/
codeSample/final
4
! Spark
! RDD(Resilient Distributed Datasets)
! Scala
! Spark MLlib
5
Outline
!
◦ 2020 44ZB(IDC 2013~2014)
◦
!
◦ MapReduce by Google(2004)
◦ Hadoop HDFS MapReduce by
Yahoo!(2005)
◦ Spark
Hadoop 10~1000 by AMPLab (2013)
! [ ]Spark
Hadoop
6
–
!
AMPLab
!
! API
◦ Java Scala Python R
! One Stack to rule them all
◦ SQL Streaming
◦ RDD
7
Spark
! Cluster Manager
◦ Standalone – Spark Manager
◦ Apache Mesos
◦ Hadoop YARN
8
Spark
! [exercise]Spark
◦ JDK 1.8
◦ spark-2.0.1.tgz(http://spark.apache.org/downloads.html)
◦ Terminal (for Mac)
● cd /Users/xxxxx/Downloads ( )
● tar -xvzf spark-2.0.1.tgz ( )
● sudo mv spark-2.0.1 /usr/local/spark (spark /usr/local)
● cd /usr/local/spark
● ./build/sbt package( spark 1 )
● ./bin/spark-shell ( Spark shell pwd /
usr/local/spark)
9
Spark (2.0.1)
[Tips]
https://goo.gl/oxNbIX
./bin/run-example org.apache.spark.examples.SparkPi
! Spark Shell Spark command line
◦ Spark
! spark-shell
◦ [ ] Spark binspark-shell
!
◦ var res1: Int = 3 + 5
◦ import org.apache.spark.rdd._
◦ val intRdd: RDD[Int]=sc.parallelize(List(1,2,3,4,5))
◦ intRdd.collect
◦ val txtRdd=sc.textFile(file:///Spark /README.md)
◦ txtRdd.count
! spark-shell
◦ [ ] :quit Ctrl D
10
Spark Shell
Spark
Scala
[Tips]:
➢ var val ?
➢ intRdd txtRdd ?
➢ org. [Tab] ?
➢ http://localhost:4040
! Spark
! RDD(Resilient Distributed Dataset)
! Scala
! Spark MLlib
11
Outline
! Google
! Map Reduce
! MapReduce
◦ Map (K1, V1) ! list(K2, V2)
◦ Reduce (K2, list(V2))!list(K3, V3)
! ( Word Count )
12
RDD MapReduce
! MapReduce on Hadoop
Word Count …
◦ iteration iteration ( )
…
13
Hadoop …
HDFS
! Spark – RDD(Resilient Distribute Datasets)
◦ In-Memory Data Processing and Sharing
◦ (tolerant) (efficient)
!
◦ (lineage) – RDD
◦ lineage
!
◦ Transformations: In memory lazy lineage RDD
◦ Action: return Storage
◦ Persistence: RDD
14
Spark …
: 1+2+3+4+5 = 15
Transformation Action
15
RDD
RDD Ref: http://spark.apache.org/docs/latest/programming-guide.html#transformations
! SparkContext.textFile – RDD
! map: RDD RDD
! filter: RDD RDD
! reduceByKey: RDD Key
RDD Key
! groupByKey: RDD Key RDD
! join cogroup: RDD Key
RDD
! sortBy reverse: RDD
! take(N): RDD N RDD
! saveAsTextFile: RDD
16
RDD
! count: RDD
! collect: RDD Collection(Seq
! head(N): RDD N
! mkString: Collection
17
[Tips]
•
• Transformation
! [Exercise] spark-shell
◦val intRDD = sc.parallelize(List(1,2,3,4,5,6,7,8,9,0))
◦intRDD.map(x => x + 1).collect()
◦intRDD.filter(x => x > 5).collect()
◦intRDD.stats
◦val mapRDD=intRDD.map{x=>(g+(x%3), x)}
◦mapRDD.groupByKey.foreach{x=>println(key: %s,
vals=%s.format(x._1, x._2.mkString(,)))}
◦mapRDD.reduceByKey(_+_).foreach(println)
◦mapRDD.reduceByKey{case(a,b) => a+b}.foreach(println)
18
RDD
! [Exercise] (The Gettysburg Address)
◦ (The Gettysburg Address)(https://
docs.google.com/file/d/0B5ioqs2Bs0AnZ1Z1TWJET2NuQlU/
view) gettysburg.txt
◦ gettysburg.txt ( )
●
◦
◦
◦
19
RDD (Word Count )
sc.textFile flatMap split
toLowerCase, filter
sortBy foreach
https://github.com/yclee0418/sparkTeach/blob/master/codeSample/exercise/
WordCount_Rdd.txt
take(5) foreach
reduceByKey
! Spark
! RDD(Resilient Distributed Datasets)
! Scala
! Spark MLlib
20
Outline
! Scala Scalable Language ( )
! Scala
◦ lambda expression
Scala
Scala: List(1,2,3,4,5).foreach(x=>println(item %d.format(x)))
Java:
Int[] intArr = new Array[] {1,2,3,4,5};
for (int x: intArr) println(String.format(item %d, x));
! scala Java .NET
! ( actor model akka)
! Spark
! import
◦ import org.apache.spark.SparkContext
◦ import org.apache.spark.rdd._ ( rdd class)
◦ import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel } (
clustering class)
!
◦ val int1: Int = 5 ( error)
◦ var int2: Int = 5 ( )
◦ val int = 5 ( )
! ( )
◦ def voidFunc(param1: Type, param2: Type2) = { … }
22
Scala
def setLogger = {
Logger.getLogger(com).setLevel(Level.OFF)
Logger.getLogger(io).setLevel(Level.OFF)
}
! ( )
◦ def rtnFunc1(param1: Type, param2: Type2): Type3 = {
val v1:Type3 = …
v1 //
}
! ( )
◦ def rtnFunc2(param1: Type, param2: Type2): (Type3, Type4) = {
val v1: Type3 = …
val v2: Type4= …
(v1, v2)
//
}
23
Scala
def getMinMax(intArr: Array[Int]):(Int,Int) = {
val min=intArr.min
val max=intArr.max
(min, max)
}
!
◦ val res = rtnFunc1(param1, param2) ( res
)
◦ val (res1, res2) = rtnFunc2(param1, param2) (
res1,res2 )
◦ val (_, res2) = rtnFunc2(param1, param2) (
)
! For Loop
◦ for (i <- collection) { … }
! For Loop ( yield )
◦ val rtnArr = for (i <- collection) yield { … }
24
Scala
val intArr = Array(1,2,3,4,5,6,7,8,9)
val multiArr=
for (i <- intArr; j <- intArr)
yield { i*j }
//multiArr 81 99
val (min,max)=getMinMax(intArr)
val (_, max)=getMinMax(intArr)
! Tuple
◦ Tuple
◦ val v=(v1,v2,v3...) v._1, v._2, v._3…
◦ lambda
◦ lambda (_)
25
Scala val intArr = Array(1,2,3,4,5,7,8,9)
val res=getMinMax(intArr) //res=(1,9)=>tuple
val min=res._1 // res
val max=res._2 // res
val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple
val intArr2=intArr.map(x=> (x._1 * x._2 * x._3))
//intArr2: Array[Int] = Array(6, 120, 504)
val intArr3=intArr.filter(x=> (x._1 + x._2 > x._3))
//intArr3: Array[(Int, Int, Int)] = Array((4,5,6), (7,8,9))
val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple
def getThird(x:(Int,Int,Int)): Int = { (x._3) }
val intArr2=intArr.map(getThird(_))
val intArr2=intArr.map(x=>getThird(x)) //
//intArr2: Array[Int] = Array(3, 6, 9)
! Class
◦ Scala Class JAVA Class
● private /
protected public
● Class
26
Scala
Scala:
class Person(userID: Int, name: String) // private
class Person(val userID: Int, var name: String)
// public userID
val person = new Person(102, John Smith)//
person.userID // 102
Person class Java :
public Class Person {
private final int userID;
private final String name;
public Person(int userID, String name) {
this.userID = userID;
this.name = name;
}}
! Object
◦ Scala static
instance
◦ Scala Object static
● Scala Object singleton class instance
! Scala Object vs Class
◦ object utility Spark Driver Program
◦ class Entity
27
Scala
Scala Object:
object Utility {
def isNumeric(input: String): Boolean = input.trim()
.matches(s[+-]?((d+(ed+)?[lL]?)|(((d+(.d*)?)|(.d+))(ed+)?[fF]?)))
def toDouble(input: String): Double = {
val rtn = if (input.isEmpty() || !isNumeric(input)) Double.NaN else input.toDouble
rtn
}}
val d = Utility.toDouble(20) // new
!
◦ val intArr = Array(1,2,3,4,5,7,8,9)
!
◦ val intArrExtra = intArr ++ Array(0,11,12)
! map:
! filter:
! join: Map Key Map
! sortBy reverse:
! take(N): N
28
scala
val intArr = Array(1,2,3,4,5,7,8,9)
val intArr2=intArr.map(_ * 2)
//intArr2: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
val intArr3=intArr.filter(_ > 5)
//intArr3: Array[Int] = Array(6, 7, 8, 9)
val intArr4=intArr.reverse
//intArr4: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)
! sum:
◦ val sum = Array(1,2,3,4,5,7,8,9).sum
! max:
◦ val max = Array(1,2,3,4,5,7,8,9).max
! min:
◦ val max = Array(1,2,3,4,5,7,8,9).min
! distinct:
29
scala
val intArr = Array(1,2,3,4,5,7,8,9)
val sum = intArr.sum
//sum = 45
val max = intArr.max
//max = 9
val min = intArr.min
//min = 1
val disc = Array(1,1,1,2,2,2,3,3)
//disc = Array(1,2,3)
! spark-shell
! ScalaIDE for eclipse 4.4.1
◦ http://scala-ide.org/download/sdk.html
◦
◦ ( )
◦
◦ ScalaIDE
30
(IDE)
! Driver Program(word complete breakpoint
)
! spark-shell jar
!
◦Eclipse 4.4.2 (Luna)
◦ Scala IDE 4.4.1
◦ Scala 2.11.8 and Scala 2.10.6
◦ Sbt 0.13.8
◦ Scala Worksheet 0.4.0
◦ Play Framework support 0.6.0
◦ ScalaTest support 2.10.0
◦ Scala Refactoring 0.10.0
◦ Scala Search 0.3.0
◦ Access to the full Scala IDE ecosystem
31
Scala IDE for eclipse
! Scala IDE Driver Program
◦ Scala Project
◦ Build Path
● Spark
● Scala
◦ package
● package ( )
◦ scala object
◦
◦ debug
◦ Jar
◦ spark-submit Spark
32
Scala IDE Driver Program
! Scala IDE
◦ FILE -> NEW ->
Scala Project
◦ project
FirstScalaProj
◦ JRE 1.8 (1.7 )
◦
◦ Finish
33
Scala Project
! Package Explorer Project Explorer
FirstScalaProj Build Path ->
Configure Build Path
34
Build Path
[Tips]:
Q: Package Project Explorer
A:
! Scala perspective
! Scala perspective
-> Window -> Show View
! Spark Driver Program Build Path
◦ jar
◦ Scala Library Container 2.11.8(IDE 2.11.8 )
! Configure Build Path Java Build Path Libraries -
>Add External JARs…
◦Spark Jar Spark /assembly/target/scala-2.11/jars/
◦ jar
! Java Build Path Scala Library Container 2.11.8
35
Build Path
! Package Explorer FirstScalaProj src
package
◦ src ->New->Package( Ctrl N)
◦ bikeSharing Package
! FirstScalaProj data (Folder) input
36
Package
! (gettysburg.txt)copy data
! bikeSharing Package Scala Object
BikeCountSort
!
37
Scala Object
package practice1
//spark lib
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd._
//log
import org.apache.log4j.Logger
import org.apache.log4j.Level
object WordCount {
def main(args: Array[String]): Unit = {
// Log Console
Logger.getLogger(org).setLevel(Level.ERROR) //mark for MLlib INFO msg
val sc = new SparkContext(new SparkConf().setAppName(WordCount).setMaster(local[*]))
val rawRdd = sc.textFile(data/gettysburg.txt).flatMap { x=>x.split( ) }
// (toLowerCase ) ( filter )
val txtRdd = rawRdd.map { x => x.toLowerCase.trim }.filter { x => !x.equals() }
val countRdd = txtRdd.map { x => (x, 1) } // 1) Map
val resultRdd = countRdd.reduceByKey { case (a, b) => a + b } // ReduceByKey
val sortResRdd = resultRdd.sortBy((x => x._2), false) //
sortResRdd.take(5).foreach { println } //
sortResRdd.saveAsTextFile(data/wc_output)
}
}
38
WordCount
import
Library
object main
saveAsTextFile
! word complete ALT
/ word complete
! ( tuple
)
39
IDE
! debug configuration
◦ icon Debug
Configurations
◦ Scala Application
Debug
● Name WordCount
● Project FirstScalaProj
● Main Class
practice1.WordCount
◦ Launcher
40
Debug Configuration
! icon Debug Configuration
Debug console
41
[Tips]
• data/output sortResRdd ( part-xxxx )
• Log Level console
• output
! Spark-Submit JAR
◦ Package Explorer FirstScalaProj -
>Export...->Java/JAR file-> FirstScalaProj src
JAR File
42
JAR
! input output JAR File
◦ data JAR
File
43
Spark-submit
! spark-submit
44
Spark-submit
1.submit
2. lunch
works
3. return status
! Command Line JAR File
! Spark-submit
./bin/spark-submit
--class <main-class> (package scala object )
--master <master-url> ( master URL local[Worker thread num])
--deploy-mode <deploy-mode> ( Worker Cluster Client Client)
--conf <key>=<value> ( Spark )
... # other options
<application-jar> (JAR )
[application-arguments] ( Driver main )
45
Spark-submit submit JOB
Spark /bin/spark-submit --class practice1.WordCount --
master local[*] WordCount.jar
[Tips]:
! spark-submit JAR data
! merge output
◦ linux: cat data/output/part-* > res.txt
◦ windows: type dataoutputpart-* > res.txt
! Exercise wordCount Package WordCount2
Object
◦ gettysburg.txt ( )
●
◦
● Hint1: (index)
● val posRdd=txtRdd.zipWithIndex()
● Hint2: reduceByKey groupByKey
index
46
Word Count
! Spark
! RDD(Resilient Distributed Datasets)
! Scala
! Spark MLlib
◦ (summary statistics)
◦ Clustering
◦ Classification
◦ Regression
47
Outline
!
◦
◦
!
◦
◦
48
Tasks Experience Performance
!
!
!
!
!
!
!
!
! DNA
!
!
49
! (Supervised learning)
◦ (Training Set)
◦ (Features)
(Label)
◦ Regression Classification (
)
50
http://en.proft.me/media/science/ml_svlw.jpg
! (Unsupervised learning)
◦ (
Label)
◦
◦ Clustering ( KMeans)
51http://www.cnblogs.com/shishanyuan/p/4747761.html
! MLlib Machine Learning library Spark
!
◦ RDD
◦
52
Spark MLlib
http://www.cnblogs.com/shishanyuan/p/4747761.html
53
Spark MLlib
https://www.safaribooksonline.com/library/view/spark-for-python/9781784399696/graphics/B03986_04_02.jpg
! Bike Sharing Dataset (
)
! https://archive.ics.uci.edu/ml/datasets/
Bike+Sharing+Dataset
◦
● hour.csv: 2011.01.01~2012.12.30
17,379
● day.csv: hour.csv
54
Spark MLlib Let’s biking
55
Bike Sharing Dataset
Features
Label
(for hour.csv only)
(0 to 6)
(1 to 4)
!
◦ (Summary Statistics):
MultivariateStatisticalSummary Statistics
◦ Feature ( ) Label ( )
(correlation) Statistics
!
◦ Clustering KMeans
!
◦ Classification Decision Tree LogisticRegressionWithSGD
!
◦ Regression Decision Tree LinearRegressionWithSGD
56
Spark MLlib
! Spark
! RDD(Resilient Distributed Datasets)
! Scala
! Spark MLlib
◦ (summary statistics)
◦ Clustering
◦ Classification
◦ Regression
57
Outline
58
! (Summary Statistics)
◦
◦
◦Spark
● 1: RDD[Double/
Float/Int] RDD stats
● 2: RDD[Vector]
Statistics.colStats
59
! (correlation)
◦ (Correlation )
◦ Spark Pearson Spearman
◦ r Statistics.corr
● 0 < | r | < 0.3 ( )
● 0.3 <= | r | < 0.7 ( )
● 0.7 <= | r | < 1 ( )
● r = 1 ( )
60
A. Scala
B. Package Scala Object
C. data Folder
D. Library
! ScalaIDE Scala folder package Object
◦ SummaryStat ( )
● src
● bike (package )
● BikeSummary (scala object )
● data (folder )
● hour.csv
! Build Path
◦ Spark /assembly/target/scala-2.11/jars/
◦ Scala container 2.11.8
61
62
A. import
B. main Driver Program
C. Log
D. SparkContext
//import spark rdd library
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
//import Statistics library
import org.apache.spark.mllib.stat.{ MultivariateStatisticalSummary,
Statistics }
object BikeSummary {
def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger
//initialize SparkContext
val sc = new SparkContext(new SparkConf().setAppName("BikeSummary").setMaster("local[*]"))
}
}
! spark-shell sparkContext sc
! Driver Program sc
◦ appName - Driver Program
◦ master - master URL
63
64
! prepare
◦ input file Features
Label
RDD
! lines.map features( 3~14 ) label( 17 ) RDD
! RDD
:
◦ RDD[Array]
◦ RDD[Tuple]
◦ RDD[BikeShareEntity]
prepare
def prepare(sc: SparkContext): RDD[???] = {
val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder
val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) =>
{ if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name)
val lines:RDD[Array[String]] = rawDataNoHead.map { x =>
x.split(, ).map { x => x.trim() } } //split columns with comma
val bikeData:RDD[???]=lines.map{ … } //??? depends on your impl
}
65
! RDD[Array]:
◦ val bikeData:RDD[Array[Double]] =lines.
map{x=>(x.slice(3,13).map(_.toDouble) ++ Array(x(16).toDouble))}
◦ 利弊: prepare實作容易,後面用起來痛苦(要記欄位在Array中的
index),也容易出包
! RDD[Tuple]:
◦ val bikeData:RDD[(Double, Double, Double, …, Double)]
=lines.map{case(season,yr,mnth,…,cnt)=>(season.toDouble, yr.toDouble,
mnth.toDouble,…cnt.toDouble)}
◦ 利弊: prepare實作較不易,後面用起來痛苦,比較不會出包(可用較
佳的變數命名來接回傳值)
◦ 例: val features = bikeData.map{case(season,yr,mnth,…,cnt)=> (season, yr,
math, …, windspeed)}
66
! RDD[ Class] :
◦ val bikeData:RDD[BikeShareEntity] = lines.map{ x=> BikeShareEntity(⋯)}
◦ 利弊: prepare實作痛苦,後面用起來快樂(用entity物件操作,不
用管欄位位置、抽象化),不易出包
◦ 例: val labelRdd = bikeData.map{ ent => { ent.label }}
Case Class Class
case class BikeShareEntity(instant: String,dteday:String,season:Double,
yr:Double,mnth:Double,hr:Double,holiday:Double,weekday:Double,
workingday:Double,weathersit:Double,temp:Double,
atemp:Double,hum:Double,windspeed:Double,casual:Double,
registered:Double,cnt:Double)
67
map RDD[BikeShareEntity]
val bikeData = rawData.map { x =>
BikeShareEntity(x(0), x(1), x(2).toDouble, x(3).toDouble,x(4).toDouble,
x(5).toDouble, x(6).toDouble,x(7).toDouble,x(8).toDouble,
x(9).toDouble,x(10).toDouble,x(11).toDouble,x(12).toDouble,
x(13).toDouble,x(14).toDouble,x(15).toDouble,x(16).toDouble) }
68
! (Class)
! prepare
◦ input file Features
Label
RDD
Entity Class
//import spark rdd library
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
//import Statistics library
import org.apache.spark.mllib.stat.
{ MultivariateStatisticalSummary, Statistics }
object BikeSummary {
case class BikeShareEntity(⋯⋯)
def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger
//initialize SparkContext
val sc = new SparkContext(new
SparkConf().setAppName("BikeSummary").setMaster("local[*]"))
}
}
69
70
! getFeatures
◦
! printSummary
◦ console
! printCorrelation
◦
console
printSummary
def printSummary(entRdd: RDD[BikeShareEntity]) = {
val dvRdd = entRdd.map { x => Vectors.dense(getFeatures(x)) } //
RDD[Vector]
// Statistics.colStats Summary Statistics
val summaryAll = Statistics.colStats(dvRdd)
println(mean: + summaryAll.mean.toArray.mkString(,))) //
println(variance: + summaryAll.variance.toArray.mkString(,))) //
}
71
getFeatures
def getFeatures(bikeData: BikeShareEntity): Array[Double] = {
//
val featureArr = Array(bikeData.casual, bikeData.registered,bikeData.cnt)
featureArr
}
72
printCorrelation
def printCorrelation(entRdd: RDD[BikeShareEntity]) = {
// RDD[Double]
val cntRdd = entRdd.map { x => x.cnt }
val yrRdd = entRdd.map { x => x.yr } //
val yrCorr = Statistics.corr(yrRdd, cntRdd)//
println(correlation: %s vs %s: %f.format(yr, cnt, yrCorr))
val seaRdd = entRdd.map { x => x.season }// season
val seaCorr = Statistics.corr(seaRdd, cntRdd)
println(correlation: %s vs %s: %f.format(season, cnt, seaCorr))
}
A.
◦ BikeSummary.scala SummaryStat
◦ hour.csv data
◦ BikeSummary ( TODO
B.
◦ getFeatures printSummary
● console (temp) (hum) (windspeed)
● yr mnth (temp) (hum)
(windspeed) (cnt) console
73
for (yr <- 0 to 1)
for (mnth <- 1 to 12) {
val yrMnRdd = entRdd.filter { ??? }.map { x => Vectors.dense(getFeatures(x)) }
val summaryYrMn = Statistics.colStats( ??? )
println(====== summary yr=%d, mnth=%d ==========.format(yr,mnth))
println(mean: + ???)
println(variance: + ???)
}
A.
◦ BikeSummary printCorrelation
◦ hour.csv [yr~windspeed] cnt
console
B. feature
◦ printCorrelation
● yr mnth feature( yrmo yrmo=yr*12+mnth)
yrmo cnt
74
! Spark
! RDD(Resilient Distributed Datasets)
! Scala
! Spark MLlib
◦ (summary statistics)
◦ Clustering
◦ Classification
◦ Regression
75
Outline
! Traing Set
( Label)
! (cluster)
!
!
◦
76
Clustering
!
! (x1,x2,...,xn) K-Means n K
(k≤n), (WCSS within-cluster sum of squares)
!
A. K
B. K
C. ( )
D. B C
77
K-Means
iteration
RUN
78
K-Means
ref: http://mropengate.blogspot.tw/2015/06/ai-ch16-5-k-introduction-to-clustering.html
! KMeans.train Model(KMeansModel
◦ val model=KMeans.train(data, numClusters, maxIterations,
runs)
● data (RDD[Vector])
● numClusters (K)
● maxIterations run Iteration
iteration maxIterations model
● runs KMeans run
model
! model.clusterCenters Feature
! model.computeCost WCSS model
79
K-Means in Spark MLlib
80
K-Means BikeSharing
! hour.csv KMeans
console
◦ Features yr, season, mnth, hr, holiday, weekday,
workingday, weathersit, temp, atemp, hum,
windspeed,cnt( cnt Label Feature )
◦ numClusters 5 ( 5 )
◦ maxIterations 20 ( run 20 iteration)
◦ runs 3 3 Run model)
81
Model
A. Scala
B. Package Scala Object
C. data Folder
D. Library
Model
K
! ScalaIDE Scala folder package Object
◦ Clustering ( )
● src
● bike (package )
● BikeShareClustering (scala object )
● data (folder )
● hour.csv
! Build Path
◦ Spark /assembly/target/scala-2.11/jars/
◦ Scala container 2.11.8
82
83
A. import
B. main Driver Program
C. Log
D. SparkContext
Model
Model
K
//import spark rdd library
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
//import KMeans library
import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel }
object BikeShareClustering {
def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger
//initialize SparkContext
val sc = new SparkContext(new SparkConf().setAppName("BikeClustering").setMaster("local[*]"))
}
}
! KMeans Library
! Driver Program sc
◦ appName - Driver Program
◦ master - master URL
84
85
! (Class)
! prepare
◦ input file Features
Label
RDD
! BikeSummary
Model
Model
K
86
! getFeatures
◦
! KMeans
◦ KMeans.train KMeansModel
! getDisplayString
◦
Model
Model
K
getFeatures getDisplayString
getFeatures
def getFeatures(bikeData: BikeShareEntity): Array[Double] = {
val featureArr = Array(bikeData.cnt, bikeData.yr, bikeData.season,
bikeData.mnth, bikeData.hr, bikeData.holiday, bikeData.weekday,
bikeData.workingday, bikeData.weathersit, bikeData.temp,
bikeData.atemp, bikeData.hum, bikeData.windspeed, bikeData.casual,
bikeData.registered)
featureArr
}
87
getDisplayString
def getDisplayString(centers:Array[Double]): String = {
val dispStr = """cnt: %.5f, yr: %.5f, season: %.5f, mnth: %.5f, hr: %.5f,
holiday: %.5f, weekday: %.5f, workingday: %.5f, weathersit: %.5f, temp:
%.5f, atemp: %.5f, hum: %.5f,windspeed: %.5f, casual: %.5f, registered:
%.5f"""
.format(centers(0), centers(1),centers(2), centers(3),centers(4),
centers(5),centers(6), centers(7),centers(8), centers(9),centers(10),
centers(11),centers(12), centers(13),centers(14))
dispStr
}
KMeans
// Features RDD[Vector]
val featureRdd = bikeData.map { x =>
Vectors.dense(getFeatures(x)) }
val model = KMeans.train(featureRdd, 5, 20, 3) // K 5
20 Iteration 3 Run
88
var clusterIdx = 0
model.clusterCenters.sortBy { x => x(0) }.foreach { x => {
println(center of cluster %d n%s.format(clusterIdx,
getDisplayString(x.toArray) ))
clusterIdx += 1
} } // Cnt
89
//K-Means
import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel }
object BikeShareClustering {
def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger
// SparkContext
val sc = new SparkContext(new SparkConf().setAppName(BikeClustering).setMaster(local[*]))
println(============== preparing data ==================)
val bikeData = prepare(sc) // hour.csv RDD[BikeShareEntity]
bikeData.persist()
println(============== clusting by KMeans ==================)
// Features RDD[Vector]
val featureRdd = bikeData.map { x => Vectors.dense(getFeatures(x)) }
val model = KMeans.train(featureRdd, 5, 20, 3) // K 5 20 Iteration 3 Run
var clusterIdx = 0
model.clusterCenters.sortBy { x => x(0) }.foreach { x => {
println(center of cluster %d n%s.format(clusterIdx, getDisplayString(x.toArray) ))
clusterIdx += 1
} } // Cnt
bikeData.unpersist()
}
90
! yr season mnth hr cnt
! weathersit cnt ( )
! temp atemp cnt ( )
! hum cnt ( )
! correlation
91
! K Model
WCSS
! WCSS K
Model
Model
K
! model.computeCost WCSS model
(WCSS )
! numClusters WCSS (K)
! WCSS
92
K-Means
println(============== tuning parameters ==================)
for (k <- Array(5,10,15,20, 25)) {
// numClusters WCSS
val iterations = 20
val tm = KMeans.train(featureRdd, k, iterations,3)
println(k=%d, WCSS=%f.format(k, tm.computeCost(featureRdd)))
}
============== tuning parameters ==================
k=5, WCSS=89540755.504054
k=10, WCSS=36566061.126232
k=15, WCSS=23705349.962375
k=20, WCSS=18134353.720998
k=25, WCSS=14282108.404025
A.
◦ BikeShareClustering.scala Scala
◦ hour.csv data
◦ BikeShareClustering ( TODO
B. feature
◦ BikeClustering
● yrmo getFeatures KMeans
console yrmo
● numClusters (ex:50,75,100)
93
K-Means
! K-Means
! KMeans
! Spark
! RDD(Resilient Distributed Datasets)
! Scala
! Spark MLlib
◦ (summary statistics)
◦ Clustering
◦ Classification
◦ Regression
94
Outline
!
(Binary Classification)
(Multi-Class Classification)
!
!
◦ (logistic regression) (decision
trees) (naive Bayes)
◦
95
!
!
(Features)
(Label)
! (Random Forest)
!
96
! import org.apache.spark.mllib.tree.DecisionTree
! import org.apache.spark.mllib.tree.model.DecisionTreeModel
! DecisionTree.trainClassifier Model(DecisionTreeModel
◦ val model=DecisionTree.trainClassifier(trainData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
● trainData RDD[LabeledPoint]
● numClasses 2
● categoricalFeaturesInfo trainData categorical Map[ Index, ]
continuous
● Map(0->2,4->10) 1,5 categorical 2,10
● impurity (Gini Entropy)
● maxDepth
overfit
● maxBins
● categoricalFeaturesInfo maxBins categoricalFeaturesInfo
97
Decision Tree in Spark MLlib
!
( )
!
threshold( )
◦ Features yr, season, mnth, hr, holiday, weekday,
workingday, weathersit, temp, atemp, hum, windspeed
◦ Label cnt 200 1 0
◦ numClasses 2
◦ impurity gini
◦ maxDepth 5
◦ maxBins 30
98
99
Model
A. Scala
B. Package Scala Object
C. data Folder
D. Library
Model
! ScalaIDE Scala folder package Object
◦ Classification ( )
● src
● bike (package )
● BikeShareClassificationDT (scala object )
● data (folder )
● hour.csv
! Build Path
◦ Spark /assembly/target/scala-2.11/jars/
◦ Scala container 2.11.8
100
101
Model
Model
A. import
B. main Driver Program
C. Log
D. SparkContext
//import spark rdd library
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
//import decision tree library
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
object BikeShareClassificationDT {
def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger
//initialize SparkContext
val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationDT").setMaster("local[*]"))
}
}
! Decision Tree Library
! Driver Program sc
◦ appName - Driver Program
◦ master - master URL
102
103
Model
Model
! (Class)
◦ BikeSummary
! prepare
◦ input file Features Label
RDD[LabeledPoint]
◦ RDD[LabeledPoint]
! getFeatures
◦ Model feature
! getCategoryInfo
◦ categroyInfoMap
104
prepare
def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder
val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) =>
{ if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name)
val lines:RDD[Array[String]] = rawDataNoHead.map { x =>
x.split(, ).map { x => x.trim() } } //split columns with comma
val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity]
val lpData=bikeData.map { x => {
val label = if (x.cnt > 200) 1 else 0 //大於200為1,否則為0
val features = Vectors.dense(getFeatures(x))
new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成
}
//以6:4的比例隨機分割,將資料切分為訓練及驗證用資料
val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4))
(trainData, validateData)
}
105
getFeatures getCategoryInfo
getFeatures方法
def getFeatures(bikeData: BikeShareEntity): Array[Double] = {
val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1,
bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday,
bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)
featureArr
} // season Feature 1
getCategoryInfo方法
def getCategoryInfo(): Map[Int, Int]= {
val categoryInfoMap = Map[Int, Int](
(/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12),
(/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7),
(/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4))
categoryInfoMap
} //( featureArr index, distinct )
106
Model
Model
! trainModel
◦ DecisionTree.trainClassifier Model
! evaluateModel
◦ AUC trainModel Model
107
trainModel evaluateModel
def trainModel(trainData: RDD[LabeledPoint],
impurity: String, maxDepth: Int, maxBins: Int,cateInfo: Map[Int, Int]):
(DecisionTreeModel, Double) = {
val startTime = new DateTime() //
val model = DecisionTree.trainClassifier(trainData, 2, cateInfo, impurity,
maxDepth, maxBins) // Model
val endTime = new DateTime() //
val duration = new Duration(startTime, endTime) //
//MyLogger.debug(model.toDebugString) // Decision Tree
(model, duration.getMillis)
}
def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double =
{
val scoreAndLabels = validateData.map { data =>
var predict = model.predict(data.features)
(predict, data.label) // RDD[( )] AUC
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auc = metrics.areaUnderROC()// areaUnderROC auc
auc
}
108
Model
Model! tuneParameter
◦ impurity Max Depth Max Bin
trainModel
evaluateModel AUC
109
AUC(Area under the Curve of ROC)
Positive
(Label 1)
Negative
(Label 0)
Positive
(Label 1)
true positive(TP) false negative(FN)
Negative
(Label 0)
false
positive(FP)
true negative(TN)
! (True Pos Rate)TPR 1 1
◦ TPR=TP/(TP+FN)
! (False Pos Rate)FPR 0 1
◦ FPR FP/(FP+TN)
! FPR TPR X Y ROC
! AUC ROC
110
AUC
AUC 1
100%
0.5 < AUC < 1
AUC 0.5
AUC < 0.5
AUC(Area under the Curve of ROC)
111
tuneParameter
def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint])
= {
val impurityArr = Array(gini, entropy)
val depthArr = Array(3, 5, 10, 15, 20, 25)
val binsArr = Array(50, 100, 200)
val evalArr =
for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr)
yield { // model AUC
val (model, duration) = trainModel(trainData, impurity, maxDepth,
maxBins, cateInfo)
val auc = evaluateModel(validateData, model)
println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f
.format(impurity, maxDepth, maxBins, auc))
(impurity, maxDepth, maxBins, auc)
}
val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC
println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f
.format(bestEval._1, bestEval._2, bestEval._3, bestEval._4))
}
112
Decision Tree
//MLlib lib
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.evaluation._
import org.apache.spark.mllib.linalg.Vectors
//decision tree
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
object BikeShareClassificationDT {
case class BikeShareEntity(…) // case class
def main(args: Array[String]): Unit = {
MyLogger.setLogger
val doTrain = (args != null && args.length > 0 && "Y".equals(args(0)))
val sc = new SparkContext(new SparkConf().setAppName("ClassificationDT").setMaster("local[*]"))
println("============== preparing data ==================")
val (trainData, validateData) = prepare(sc)
val cateInfo = getCategoryInfo()
if (!doTrain) {
println("============== train Model (CateInfo)==================")
val (modelC, durationC) = trainModel(trainData, "gini", 5, 30, cateInfo)
val aucC = evaluateModel(validateData, modelC)
println("validate auc(CateInfo)=%f".format(aucC))
} else {
println("============== tuning parameters(cateInfo) ==================")
tuneParameter(trainData, validateData, cateInfo)
}
}
}
A.
◦ BikeShareClassificationDT.scala Scala
◦ hour.csv data
◦ BikeShareClassificationDT ( TODO
B. feature
◦ BikeShareClassificationDT
● category AUC
● feature ( |correlation| > 0.1 ) Model AUC
113
Decision Tree
============== tuning parameters(cateInfo) ==================
parameter: impurity=gini, maxDepth=3, maxBins=50, auc=0.835524
parameter: impurity=gini, maxDepth=3, maxBins=100, auc=0.835524
parameter: impurity=gini, maxDepth=3, maxBins=200, auc=0.835524
parameter: impurity=gini, maxDepth=5, maxBins=50, auc=0.851846
parameter: impurity=gini, maxDepth=5, maxBins=100, auc=0.851846
parameter: impurity=gini, maxDepth=5, maxBins=200, auc=0.851846
! (simple linear regression, :y=ax+b)
(y)
◦ (x) (y)
!
(Logistic regression)
◦
! S (sigmoid) p(probability)
0.5 [ ] [ ]
114
! import org.apache.spark.mllib.classification.{
LogisticRegressionWithSGD, LogisticRegressionModel }
! LogisticRegressionWithSGD.train(trainData, numIterations,
stepSize, miniBatchFraction) Model(LogisticRegressionModel
◦val model=LogisticRegressionWithSGD.train(trainData,numIterations,
stepSize, miniBatchFraction)
● trainData RDD[LabeledPoint]
● numIterations (SGD) 100
● stepSize SGD 1
● miniBatchFraction 0~1
1
115
Logistic Regression in Spark
http://www.csie.ntnu.edu.tw/~u91029/Optimization.html
! LogisticRegression train Categorical
Feature one-of-
k(one-hot) encoding
! One-of-K encoding:
◦ N (N= )
◦ index 1 0
116
Categorical Features
weather Value
Clear 1
Mist 2
Light Snow 3
Heavy Rain 4
weathersit Index
1 0
2 1
3 2
4 3
INDEX
Map
Index Encode
0 1000
1 0100
2 0010
3 0001
Encoding
117
Model
A. Scala
B. Package Scala Object
C. data Folder
D. Library
Model
! ScalaIDE Scala folder package Object
◦ Classification ( )
● src
● bike (package )
● BikeShareClassificationLG (scala object )
● data (folder )
● hour.csv
! Build Path
◦ Spark /assembly/target/scala-2.11/jars/
◦ Scala container 2.11.8
118
119
Model
Model
A. import
B. main Driver Program
C. Log
D. SparkContext
//import spark rdd library
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
//import Logistic library
//Logistic
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.classification.{ LogisticRegressionWithSGD, LogisticRegressionModel }
object BikeShareClassificationLG {
def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger
//initialize SparkContext
val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationLG").setMaster("local[*]"))
}
}
! Logistic Regression Library
! Driver Program sc
◦ appName - Driver Program
◦ master - master URL
120
121
Model
Model
! (Class)
◦ BikeSummary
! prepare
◦ input file Features Label
RDD[LabeledPoint]
◦ RDD[LabeledPoint]
! getFeatures
◦ Model feature
! getCategoryFeature
◦ 1-of-k encode Array[Double]
One-Of-K
def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
…
val bikeData = lines.map{ x => new BikeShareEntity(x) }//RDD[BikeShareEntity]
val weatherMap=bikeData. .map { x => x.getField(weathersit) }
.distinct().collect().zipWithIndex.toMap //產生Index Map
val lpData=bikeData.map { x => {
val label = x.getLabel()
val features = Vectors.dense(x.getFeatures(weatherMap))
new LabeledPoint(label, features } //LabeledPoint由label及Vector組成
} … }
def getFeatures (weatherMap: Map[Double, Int])= {
var rtnArr: Array[Double] = Array()
var weatherArray:RDD[Double] = Array.ofDim[Double](weatherMap.size)
//weatherArray=Array(0,0,0,0)
val index = weatherMap(getField(weathersit)) //weathersit=2; index=1
weatherArray(index) = 1 //weatherArray=Array(0,1,0,0)
rtnArr = rtnArr ++ weatherArray
…. }
! (Standardizes)
(variance) /
!
StandardScaler
def prepare(sc): RDD[LabeledPoint] = { …
val featureRdd = bikeData.map { x => Vectors.dense(x.getFeatures(weatherMap)) }
//用整個Feature的RDD取得StandardScaler
val stdScaler = new StandardScaler(withMean=true, withStd=true).fit(featureRdd )
val lpData2= bikeData.map { x =>
{
val label = x.getLabel()
//在建立LabeledPoint前,先對feature作標準化轉換動作
val features = stdScaler.transform(Vectors.dense(x.getFeatures(weatherMap)))
new LabeledPoint(label, features)
} }
…
prepare
def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
…
val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity]
val weatherMap=bikeData.map { x =>
x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map
//Standardize
val featureRddWithMap = bikeData.map { x =>
Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap,
weekdayMap, workdayMap, weatherMap))}
val stdScalerWithMap = new StandardScaler(withMean = true, withStd =
true).fit(featureRddWithMap)
// Category feature
val lpData = bikeData.map { x => {
val label = if (x.cnt > 200) 1 else 0 // 200 1 0
val features =
stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap,
mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap)))
new LabeledPoint(label, features)
}}
// 6:4
val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4))
(trainData, validateData)
}
125
getFeatures
getFeatures方法
def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int],
seasonMap: Map[Double, Int], mnthMap: Map[Double, Int],
hrMap: Map[Double, Int], holidayMap: Map[Double, Int],
weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int],
weatherMap: Map[Double, Int]): Array[Double] = {
var featureArr: Array[Double] = Array()
//
featureArr ++= getCategoryFeature(bikeData.yr, yrMap)
featureArr ++= getCategoryFeature(bikeData.season, seasonMap)
featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap)
featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap)
featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap)
featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap)
featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap)
featureArr ++= getCategoryFeature(bikeData.hr, hrMap)
//
featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)
featureArr
}
126
getCategoryFeature
getCategoryFeature方法
def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]):
Array[Double] = {
var featureArray = Array.ofDim[Double](categoryMap.size)
val index = categoryMap(fieldVal)
featureArray(index) = 1
featureArray
}
127
Model
Model
! trainModel
◦ DecisionTree.trainClassifier Model
! evaluateModel
◦ AUC trainModel Model
128
trainModel evaluateModel
def trainModel(trainData: RDD[LabeledPoint],
numIterations: Int, stepSize: Double, miniBatchFraction: Double):
(LogisticRegressionModel, Double) = {
val startTime = new DateTime()
// LogisticRegressionWithSGD.train
val model = LogisticRegressionWithSGD.train(trainData, numIterations, stepSize,
miniBatchFraction)
val endTime = new DateTime()
val duration = new Duration(startTime, endTime)
//MyLogger.debug(model.toPMML()) // model debug
(model, duration.getMillis)
}
def evaluateModel(validateData: RDD[LabeledPoint], model: LogisticRegressionModel):
Double = {
val scoreAndLabels = validateData.map { data =>
var predict = model.predict(data.features)
(predict, data.label) // RDD[( )] AUC
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auc = metrics.areaUnderROC()// areaUnderROC auc
auc
}
129
Model
Model! tuneParameter
◦ iteration stepSize miniBatchFraction
trainModel evaluateModel
AUC
130
tuneParameter
def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint])
= {
val iterationArr: Array[Int] = Array(5, 10, 20, 60,100)
val stepSizeArr: Array[Double] = Array(10, 50, 100, 200)
val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1)
val evalArr =
for (iteration <- iterationArr; stepSize <- stepSizeArr;
miniBatchFraction <- miniBatchFractionArr)
yield { // model AUC
val (model, duration) = trainModel(ttrainData, iteration, stepSize,
miniBatchFraction)
val auc = evaluateModel(validateData, model)
println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f
.format(iteration, stepSize, miniBatchFraction, auc))
(iteration, stepSize, miniBatchFraction, auc)
}
val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC
println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f
.format(bestEval._1, bestEval._2, bestEval._3, bestEval._4))
}
A.
◦ BikeShareClassificationLG.scala Scala
◦ hour.csv data
◦ BikeShareClassificationLG ( TODO
B. feature
◦ BikeShareClassificationLG
● category AUC
● feature ( |correlation| > 0.1 ) Model AUC
131
Logistic Regression
============== tuning parameters(Category) ==================
parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=0.500000, auc=0.857073
parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=0.800000, auc=0.855904
parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=1.000000, auc=0.855685
parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=0.500000, auc=0.852388
parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=0.800000, auc=0.852901
parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=1.000000, auc=0.853237
parameter: iteraion=5, stepSize=100.000000, miniBatchFraction=0.500000, auc=0.852087
! Spark
! RDD(Resilient Distributed Datasets)
! Scala
! Spark MLlib
◦ (summary statistics)
◦ Clustering
◦ Classification
◦ Regression
132
Outline
!
!
!
◦ (Least Squares) Lasso
(ridge regression)
133
! import org.apache.spark.mllib.tree.DecisionTree
! import org.apache.spark.mllib.tree.model.DecisionTreeModel
! DecisionTree.trainRegressor Model(DecisionTreeModel
◦ val model=DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo, impurity,
maxDepth, maxBins)
● trainData RDD[LabeledPoint]
● categoricalFeaturesInfo trainData categorical Map[ Index, ]
continuous
● Map(0->2,4->10) 1,5 categorical 2,10
● impurity ( variance)
● maxDepth
overfit
● maxBins
● categoricalFeaturesInfo maxBins categoricalFeaturesInfo
134
Decision Tree Regression in Spark
! Model
◦ Features yr, season, mnth, hr, holiday,
weekday, workingday, weathersit, temp, atemp,
hum, windspeed
◦ Label cnt
◦ impurity gini
◦ maxDepth 5
◦ maxBins 30
135
136
Model
A. Scala
B. Package Scala Object
C. data Folder
D. Library
Model
! ScalaIDE Scala folder package Object
◦ Regression ( )
● src
● bike (package )
● BikeShareRegressionDT (scala object )
● data (folder )
● hour.csv
! Build Path
◦ Spark /assembly/target/scala-2.11/jars/
◦ Scala container 2.11.8
137
138
Model
Model
A. import
B. main Driver Program
C. Log
D. SparkContext
//import spark rdd library
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
//import decision tree library
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
object BikeShareRegressionDT {
def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger
//initialize SparkContext
val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionDT").setMaster("local[*]"))
}
}
! Decision Tree Library
! Driver Program sc
◦ appName - Driver Program
◦ master - master URL
139
140
Model
Model
! (Class)
◦ BikeSummary
! prepare
◦ input file Features Label
RDD[LabeledPoint]
◦ RDD[LabeledPoint]
! getFeatures
◦ Model feature
! getCategoryInfo
◦ categroyInfoMap
141
prepare
def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder
val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) =>
{ if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name)
val lines:RDD[Array[String]] = rawDataNoHead.map { x =>
x.split(, ).map { x => x.trim() } } //split columns with comma
val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity]
val lpData=bikeData.map { x => {
val label = x.cnt //預測目標為租借量欄位
val features = Vectors.dense(getFeatures(x))
new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成
}
//以6:4的比例隨機分割,將資料切分為訓練及驗證用資料
val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4))
(trainData, validateData)
}
142
getFeatures getCategoryInfo
getFeatures方法
def getFeatures(bikeData: BikeShareEntity): Array[Double] = {
val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1,
bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday,
bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)
featureArr
} // season Feature 1
getCategoryInfo方法
def getCategoryInfo(): Map[Int, Int]= {
val categoryInfoMap = Map[Int, Int](
(/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12),
(/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7),
(/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4))
categoryInfoMap
} //( featureArr index, distinct )
143
Model
Model
! trainModel
◦ DecisionTree.trainRegressor Model
! evaluateModel
◦ RMSE trainModel
Model
! (root-mean-square deviation) (root-
mean-square error)
! (sample
standard deviation)
!
144
RMSE(root-mean-square error)
145
trainModel evaluateModel
def trainModel(trainData: RDD[LabeledPoint],
impurity: String, maxDepth: Int, maxBins: Int, cateInfo: Map[Int,Int]):
(DecisionTreeModel, Double) = {
val startTime = new DateTime() //
val model = DecisionTree.trainRegressor(trainData, cateInfo, impurity, maxDepth,
maxBins) // Model
val endTime = new DateTime() //
val duration = new Duration(startTime, endTime) //
//MyLogger.debug(model.toDebugString) // Decision Tree
(model, duration.getMillis)
}
def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double =
{
val scoreAndLabels = validateData.map { data =>
var predict = model.predict(data.features)
(predict, data.label) // RDD[( )] RMSE
}
val metrics = new RegressionMetrics(scoreAndLabels)
val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse
rmse
}
146
Model
Model! tuneParameter
◦ Max Depth Max Bin
trainModel evaluateModel
RMSE
147
tuneParameter
def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint])
= {
val impurityArr = Array(variance)
val depthArr = Array(3, 5, 10, 15, 20, 25)
val binsArr = Array(50, 100, 200)
val evalArr =
for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr)
yield { // model RMSE
val (model, duration) = trainModel(trainData, impurity, maxDepth,
maxBins, cateInfo)
val rmse = evaluateModel(validateData, model)
println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f
.format(impurity, maxDepth, maxBins, rmse))
(impurity, maxDepth, maxBins, rmse)
}
val bestEvalAsc = (evalArr.sortBy(_._4))
val bestEval = bestEvalAsc(0) //RMSE
println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, rmse=%f
.format(bestEval._1, bestEval._2, bestEval._3, bestEval._4))
}
A.
◦ BikeShareRegressionDT.scala Scala
◦ hour.csv data
◦ BikeShareRegressionDT.scala ( TODO
B. feature
◦ BikeShareRegressionDT
● feature dayType(Double ) dayType
● holiday=0 workingday=0 dataType=0
● holiday=1 dataType=1
● holiday=0 workingday=1 dataType=2
● dayType feature Model( getFeatures getCategoryInfo)
◦ Categorical Info
148
Decision Tree
============== tuning parameters(CateInfo) ==================
parameter: impurity=variance, maxDepth=3, maxBins=50, rmse=118.424606
parameter: impurity=variance, maxDepth=3, maxBins=100, rmse=118.424606
parameter: impurity=variance, maxDepth=3, maxBins=200, rmse=118.424606
parameter: impurity=variance, maxDepth=5, maxBins=50, rmse=93.138794
parameter: impurity=variance, maxDepth=5, maxBins=100, rmse=93.138794
parameter: impurity=variance, maxDepth=5, maxBins=200, rmse=93.138794
! Least Squares
!
149
! import org.apache.spark.mllib.regression.{LinearRegressionWithSGD,
LinearRegressionModel}
! LinearRegressionWithSGD.train(trainData, numIterations,
stepSize) Model(LinearRegressionModel
◦ val model=LinearRegressionWithSGD.train(trainData, numIterations,
stepSize)
● trainData RDD[LabeledPoint]
● numIterations (SGD)
● stepSize SGD 1
stepSize
● miniBatchFraction 0~1
1
150
Least Squares Regression in Spark
151
Model
A. Scala
B. Package Scala Object
C. data Folder
D. Library
Model
! ScalaIDE Scala folder package Object
◦ Regression ( )
● src
● bike (package )
● BikeShareRegressionLR (scala object )
● data (folder )
● hour.csv
! Build Path
◦ Spark /assembly/target/scala-2.11/jars/
◦ Scala container 2.11.8
152
153
Model
Model
A. import
B. main Driver Program
C. Log
D. SparkContext
//import spark rdd library
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
//import linear regression library
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.classification.{ LinearRegressionWithSGD, LinearRegressionModel }
object BikeShareRegressionLR {
def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger
//initialize SparkContext
val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionLR").setMaster("local[*]"))
}
}
! Linear Regression Library
! Driver Program sc
◦ appName - Driver Program
◦ master - master URL
154
155
Model
Model
! (Class)
◦ BikeSummary
! prepare
◦ input file Features Label
RDD[LabeledPoint]
◦ RDD[LabeledPoint]
! getFeatures
◦ Model feature
! getCategoryFeature
◦ 1-of-k encode Array[Double]
One-Of-K
def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
…
val bikeData = lines.map{ x => new BikeShareEntity(x) }//RDD[BikeShareEntity]
val weatherMap=bikeData. .map { x => x.getField(weathersit) }
.distinct().collect().zipWithIndex.toMap //產生Index Map
val lpData=bikeData.map { x => {
val label = x.getLabel()
val features = Vectors.dense(x.getFeatures(weatherMap))
new LabeledPoint(label, features } //LabeledPoint由label及Vector組成
} … }
def getFeatures (weatherMap: Map[Double, Int])= {
var rtnArr: Array[Double] = Array()
var weatherArray:RDD[Double] = Array.ofDim[Double](weatherMap.size)
//weatherArray=Array(0,0,0,0)
val index = weatherMap(getField(weathersit)) //weathersit=2; index=1
weatherArray(index) = 1 //weatherArray=Array(0,1,0,0)
rtnArr = rtnArr ++ weatherArray
…. }
! (Standardizes)
(variance) /
!
StandardScaler
def prepare(sc): RDD[LabeledPoint] = { …
val featureRdd = bikeData.map { x => Vectors.dense(x.getFeatures(weatherMap)) }
//用整個Feature的RDD取得StandardScaler
val stdScaler = new StandardScaler(withMean=true, withStd=true).fit(featureRdd )
val lpData2= bikeData.map { x =>
{
val label = x.getLabel()
//在建立LabeledPoint前,先對feature作標準化轉換動作
val features = stdScaler.transform(Vectors.dense(x.getFeatures(weatherMap)))
new LabeledPoint(label, features)
} }
…
prepare
def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
…
val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity]
val weatherMap=bikeData.map { x =>
x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map
//Standardize
val featureRddWithMap = bikeData.map { x =>
Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap,
weekdayMap, workdayMap, weatherMap))}
val stdScalerWithMap = new StandardScaler(withMean = true, withStd =
true).fit(featureRddWithMap)
// Category feature
val lpData = bikeData.map { x => {
val label = x.cnt //
val features =
stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap,
mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap)))
new LabeledPoint(label, features)
}}
// 6:4
val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4))
(trainData, validateData)
}
159
getFeatures
getFeatures方法
def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int],
seasonMap: Map[Double, Int], mnthMap: Map[Double, Int],
hrMap: Map[Double, Int], holidayMap: Map[Double, Int],
weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int],
weatherMap: Map[Double, Int]): Array[Double] = {
var featureArr: Array[Double] = Array()
//
featureArr ++= getCategoryFeature(bikeData.yr, yrMap)
featureArr ++= getCategoryFeature(bikeData.season, seasonMap)
featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap)
featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap)
featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap)
featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap)
featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap)
featureArr ++= getCategoryFeature(bikeData.hr, hrMap)
//
featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)
featureArr
}
160
getCategoryFeature
getCategoryFeature方法
def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]):
Array[Double] = {
var featureArray = Array.ofDim[Double](categoryMap.size)
val index = categoryMap(fieldVal)
featureArray(index) = 1
featureArray
}
161
Model
Model
! trainModel
◦ DecisionTree.trainRegressor Model
! evaluateModel
◦ RMSE trainModel
Model
162
trainModel evaluateModel
def trainModel(trainData: RDD[LabeledPoint],
numIterations: Int, stepSize: Double, miniBatchFraction: Double):
(LinearRegressionModel, Double) = {
val startTime = new DateTime()
// LinearRegressionWithSGD.train
val model = LinearRegressionWithSGD.train(trainData, numIterations, stepSize,
miniBatchFraction)
val endTime = new DateTime()
val duration = new Duration(startTime, endTime)
//MyLogger.debug(model.toPMML()) // model debug
(model, duration.getMillis)
}
def evaluateModel(validateData: RDD[LabeledPoint], model: LinearRegressionModel):
Double = {
val scoreAndLabels = validateData.map { data =>
var predict = model.predict(data.features)
(predict, data.label) // RDD[( )] RMSE
}
val metrics = new RegressionMetrics(scoreAndLabels)
val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse
rmse
}
163
Model
Model! tuneParameter
◦ iteration stepSize miniBatchFraction
trainModel evaluateModel
RMSE
164
tuneParameter
def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint])
= {
val iterationArr: Array[Int] = Array(5, 10, 20, 60,100)
val stepSizeArr: Array[Double] = Array(10, 50, 100, 200)
val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1)
val evalArr =
for (iteration <- iterationArr; stepSize <- stepSizeArr;
miniBatchFraction <- miniBatchFractionArr)
yield { // model RMSE
val (model, duration) = trainModel(ttrainData, iteration, stepSize,
miniBatchFraction)
val rmse = evaluateModel(validateData, model)
println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f
.format(iteration, stepSize, miniBatchFraction, rmse))
(iteration, stepSize, miniBatchFraction, rmse)
}
val bestEvalAsc = (evalArr.sortBy(_._4))
val bestEval = bestEvalAsc(0) //RMSE
println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f
.format(bestEval._1, bestEval._2, bestEval._3, bestEval._4))
}
A.
◦ BikeShareRegressionLR.scala Scala
◦ hour.csv data
◦ BikeShareRegressionLR.scala ( TODO
B. feature
◦ BikeShareRegressionLR
● feature dayType(Double ) dayType
● holiday=0 workingday=0 dataType=0
● holiday=1 dataType=1
● holiday=0 workingday=1 dataType=2
● dayType feature Model( getFeatures getCategoryInfo)
◦
165
Linear Regression
============== tuning parameters(Category) ==================
parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=0.500000, rmse=256.370620
parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=0.800000, rmse=256.376770
parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=1.000000, rmse=256.407185
parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=0.500000, rmse=250.037095
parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=0.800000, rmse=250.062817
parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=1.000000, rmse=250.126173
! Random Forest (multitude) (Decision
Tree)
◦ (mode)
◦ (mean)
!
◦ overfit
◦ (missing value)
◦
166
(RandomForest)
! import org.apache.spark.mllib.tree.RandomForest
! import org.apache.spark.mllib.tree.model.RandomForestModelimport
! RandomForest.trainRegressor Model(RandomForestModel
◦ val model=RandomForest.trainRegressor(trainData, categoricalFeaturesInfo,numTrees,
featureSubsetStrategy, impurity, maxDepth, maxBins)
● trainData RDD[LabeledPoint]
● categoricalFeaturesInfo trainData categorical Map[ Index, ]
continuous
● Map(0->2,4->10) 1,5 categorical 2,10
● numTrees ( Model )
● impurity ( variance)
● featureSubsetStrategy Feature ( auto )
● maxDepth
● overfit
●
● maxBins
● categoricalFeaturesInfo maxBins categoricalFeaturesInfo
167
Random Forest Regression in Spark
168
trainModel evaluateModel
def trainModel(trainData: RDD[LabeledPoint],
impurity: String, maxDepth: Int, maxBins: Int,): (RandomForestModel, Double) = {
val startTime = new DateTime() //
val cateInfo = BikeShareEntity.getCategoryInfo(true) // categoricalFeaturesInfo
val model = RandomForest.trainRegressor(trainData, cateInfo, 3, auto,impurity,
maxDepth, maxBins) // Model
val endTime = new DateTime() //
val duration = new Duration(startTime, endTime) //
//MyLogger.debug(model.toDebugString) // Decision Tree
(model, duration.getMillis)
}
def evaluateModel(validateData: RDD[LabeledPoint], model: RandomForestModel): Double =
{
val scoreAndLabels = validateData.map { data =>
var predict = model.predict(data.features)
(predict, data.label) // RDD[( )] RMSE
}
val metrics = new RegressionMetrics(scoreAndLabels)
val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse
rmse
}
! [Exercise]
◦ Regression.zip Package Object
data Build Path Scala
IDE
◦ BikeShareRegressionRF
◦ RandomForest Decision Tree
169
RandomForest Regression
! Spark
! RDD(Resilient Distributed Datasets)
! Scala
! Spark MLlib
!
170
Outline
! Etu & udn Hadoop Competition 2016 

! ETU udn udn
(
Open Data)


! EHC 2015/6 ~ 2015/10
(View) (Order)
(Member) 2015/11
(Storeid) (Cat_id1) (Cat_id2)
172
1) Data Feature LabeledPoint Data
◦ Feature : 6~9 View/Order
◦ Label : 10 Order ( 1 0)
◦ Feature : …
2) LabeledPoint Data Training Set Validating Set( 6:4
Split)
3) Training Set Validating Set Machine Learning Model
4) Testing Set
◦ Feature : 6~10 View/Order
◦ Features : 1)
5) 3) Model Testing Set
6)
7) 1) ~ 6)
173
! View/Order uid-storeid-cat_id1-cat_id2
Features
! ( RFM 6~9 View/Order )
◦ View – viewRecent, viewCnt, viewLast1MCnt,
viewLast2MCnt( ,6~9 , ,
)
◦ Order – orderRecent, orderCnt, orderLast1MCnt,
orderLast2MCnt( ,6~9 , ,
)
◦ avgDaySpan, avgViewCnt, lastViewDate, lastViewCnt (
, , ,
)
174
– Features(I)
!
◦ gender, ageScore, cityScore( , encoding,
encoding)
◦ ageScore: 1~11
● EX: if (ages.equals(20 )) ageScore = 1
◦ cityScore: 1~24
● EX: if (livecity.equals( )) cityScore = 24
! Miss Value
◦ N
● Gender: 2( )
● Ages: 35-39
● City:
175
– Features(II)
! ( )
◦ http://www.cwb.gov.tw/V7/climate/monthlyData/mD.htm
◦ 6~10
◦
◦
◦ : https://drive.google.com/file/d/0B-
b4FvCO9SYoN2VBSVNjN3F3a0U/view?usp=sharing
! 35 Features( uid-storeid-cat_id1-cat_id2)
176
– Features(III)
177
– LabeledPoint Data
Sort N
Encoding
EX: viewCnt(
5 Encoding)
7 3 2 1
viewCnt
=5
viewCnt
=4
viewCnt
=3
viewCnt
=2
viewCnt
=1
! Xgboost (Extreme Gradient Boosting, )
◦ Input: LabeledPoint Data(Training Set)
● 35 Features
● Label (1/0 Label=1 0)
◦ Parameter:
● max_depth: Tree
● nround:
● Objective: binary:logistic( )
◦ Implement:
178
– Machine Learning(I)
val param = List(objective -> binary:logistic, max_depth -> 6)
val model = XGBoost.train(trainSet, param, nround, 2, null, null)
! Xgboost
◦ Evaluate(with validating Set):
● val predictRes = model.predict(validateSet)
● F_measure
◦ Parameter Tuning:
● max_depth=(5~10) nround=(10~25)
( )
● : max_depth=6, nround=10
179
– Machine Learning(II)
Precision = 0.16669166166766647 F1 measure = 0.15969926394341
Accuracy = 0.15065655700028824 Micro recall = 0.21370309951060
Micro precision = 0.3715258082813 Micro F1 measure = 0.271333885
! Performance Improvement
◦ model N Feature Feature
180
– Machine Learning(III)
: 90000ms -> 72000ms(local mode)
! yarn resource manager
◦ spark-submit JOB Worker
181
spark-submit --class ehc.RecommandV4 --deploy-mode cluster --
master yarn ehcFinalV4.jar
! new SparkContext master URL
new SparkContext(new
SparkConf().setAppName(ehcFinal051).setMaster(local[4]))
➔ SetMaster ( spark-submit )
182
Spark-submit Run Script Sample
###### Script Spark ( yarn Manager) Spark Submit Driver Program ######
###### for linux-like system #########
# delete output on hdfs first
`hadoop fs -rm -R -f /user/team007/data/output`
# submit spark job
echo -e processing spark job
spark-submit --deploy-mode cluster --master yarn --jars lib/jcommon-1.0.23.jar,lib/
joda-time-2.2.jar --class --class ehc.RecommandV4 ehcFinalV4.jar Y
# write to result_yyyyMMddHHmmss.txt
echo -e write to outFile
hadoop fs -cat /user/team007/data/output/part-* > 'result_'`date +%Y%m%d%H%M%S`'.txt'
! Feature
! Feature
◦
183
–
! Input Single Node
◦ Worker merge
◦ uid-storeid-cat_id1-cat_id2 Sort
! F-Measure
◦ Model
◦ Spark MultilabelMetrics
!
◦
184
–
val scoreAndLabels: RDD[(Array[Double], Array[Double])] = …
val metrics = new MultilabelMetrics(scoreAndLabels)
println(sF1 measure = ${metrics.f1Measure})
!
! Spark MLlib
◦ Feature
Engineering
! Spark MLlib
◦
185
186

Mais conteúdo relacionado

Mais procurados

Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室台灣資料科學年會
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithmsDuyhai Doan
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Holden Karau
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 

Mais procurados (20)

Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
吳齊軒/漫談 R 的學習挑戰與 R 語言翻轉教室
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 

Destaque

[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
 
[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123台灣資料科學年會
 
[系列活動] Python 程式語言起步走
[系列活動] Python 程式語言起步走[系列活動] Python 程式語言起步走
[系列活動] Python 程式語言起步走台灣資料科學年會
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...Chris Fregly
 
Machine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningMachine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningArshad Ahmed
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Chris Fregly
 
Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Data Science Thailand
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Chris Fregly
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Chris Fregly
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly ProblemMark Chang
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Chris Fregly
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用Mark Chang
 
Machine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math RefresherMachine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math Refresherbutest
 

Destaque (20)

[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123
 
[系列活動] Python 程式語言起步走
[系列活動] Python 程式語言起步走[系列活動] Python 程式語言起步走
[系列活動] Python 程式語言起步走
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
機器學習速遊
機器學習速遊機器學習速遊
機器學習速遊
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
 
Machine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningMachine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine Learning
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
 
Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
 
02 math essentials
02 math essentials02 math essentials
02 math essentials
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
 
Machine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math RefresherMachine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math Refresher
 

Semelhante a [DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探

Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDatabricks
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 

Semelhante a [DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探 (20)

Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Spark4
Spark4Spark4
Spark4
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 

Mais de 台灣資料科學年會

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用台灣資料科學年會
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告台灣資料科學年會
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇台灣資料科學年會
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 台灣資料科學年會
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵台灣資料科學年會
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用台灣資料科學年會
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告台灣資料科學年會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人台灣資料科學年會
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維台灣資料科學年會
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察台灣資料科學年會
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰台灣資料科學年會
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳台灣資料科學年會
 

Mais de 台灣資料科學年會 (20)

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
 
台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
 

Último

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 

Último (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探

  • 2. 2 Law[Data applications] are like sausages. It is better not to see them being made. —Otto von Bismarck
  • 3. ! Spark ◦ ● Spark ● Scala ● RDD ! LAB ◦ ~ ● Spark Scala IDE ! Spark MLlib ◦ … ● Scala + lambda + Spark MLlib ● Clustering Classification Regression 3
  • 4. ! github page: https://github.com/yclee0418/sparkTeach ◦ installation: Spark ◦ codeSample: Spark ● exercise - ● https://github.com/yclee0418/sparkTeach/tree/master/ codeSample/exercise ● final - ● https://github.com/yclee0418/sparkTeach/tree/master/ codeSample/final 4
  • 5. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib 5 Outline
  • 6. ! ◦ 2020 44ZB(IDC 2013~2014) ◦ ! ◦ MapReduce by Google(2004) ◦ Hadoop HDFS MapReduce by Yahoo!(2005) ◦ Spark Hadoop 10~1000 by AMPLab (2013) ! [ ]Spark Hadoop 6 –
  • 7. ! AMPLab ! ! API ◦ Java Scala Python R ! One Stack to rule them all ◦ SQL Streaming ◦ RDD 7 Spark
  • 8. ! Cluster Manager ◦ Standalone – Spark Manager ◦ Apache Mesos ◦ Hadoop YARN 8 Spark
  • 9. ! [exercise]Spark ◦ JDK 1.8 ◦ spark-2.0.1.tgz(http://spark.apache.org/downloads.html) ◦ Terminal (for Mac) ● cd /Users/xxxxx/Downloads ( ) ● tar -xvzf spark-2.0.1.tgz ( ) ● sudo mv spark-2.0.1 /usr/local/spark (spark /usr/local) ● cd /usr/local/spark ● ./build/sbt package( spark 1 ) ● ./bin/spark-shell ( Spark shell pwd / usr/local/spark) 9 Spark (2.0.1) [Tips] https://goo.gl/oxNbIX ./bin/run-example org.apache.spark.examples.SparkPi
  • 10. ! Spark Shell Spark command line ◦ Spark ! spark-shell ◦ [ ] Spark binspark-shell ! ◦ var res1: Int = 3 + 5 ◦ import org.apache.spark.rdd._ ◦ val intRdd: RDD[Int]=sc.parallelize(List(1,2,3,4,5)) ◦ intRdd.collect ◦ val txtRdd=sc.textFile(file:///Spark /README.md) ◦ txtRdd.count ! spark-shell ◦ [ ] :quit Ctrl D 10 Spark Shell Spark Scala [Tips]: ➢ var val ? ➢ intRdd txtRdd ? ➢ org. [Tab] ? ➢ http://localhost:4040
  • 11. ! Spark ! RDD(Resilient Distributed Dataset) ! Scala ! Spark MLlib 11 Outline
  • 12. ! Google ! Map Reduce ! MapReduce ◦ Map (K1, V1) ! list(K2, V2) ◦ Reduce (K2, list(V2))!list(K3, V3) ! ( Word Count ) 12 RDD MapReduce
  • 13. ! MapReduce on Hadoop Word Count … ◦ iteration iteration ( ) … 13 Hadoop … HDFS
  • 14. ! Spark – RDD(Resilient Distribute Datasets) ◦ In-Memory Data Processing and Sharing ◦ (tolerant) (efficient) ! ◦ (lineage) – RDD ◦ lineage ! ◦ Transformations: In memory lazy lineage RDD ◦ Action: return Storage ◦ Persistence: RDD 14 Spark … : 1+2+3+4+5 = 15 Transformation Action
  • 16. ! SparkContext.textFile – RDD ! map: RDD RDD ! filter: RDD RDD ! reduceByKey: RDD Key RDD Key ! groupByKey: RDD Key RDD ! join cogroup: RDD Key RDD ! sortBy reverse: RDD ! take(N): RDD N RDD ! saveAsTextFile: RDD 16 RDD
  • 17. ! count: RDD ! collect: RDD Collection(Seq ! head(N): RDD N ! mkString: Collection 17 [Tips] • • Transformation
  • 18. ! [Exercise] spark-shell ◦val intRDD = sc.parallelize(List(1,2,3,4,5,6,7,8,9,0)) ◦intRDD.map(x => x + 1).collect() ◦intRDD.filter(x => x > 5).collect() ◦intRDD.stats ◦val mapRDD=intRDD.map{x=>(g+(x%3), x)} ◦mapRDD.groupByKey.foreach{x=>println(key: %s, vals=%s.format(x._1, x._2.mkString(,)))} ◦mapRDD.reduceByKey(_+_).foreach(println) ◦mapRDD.reduceByKey{case(a,b) => a+b}.foreach(println) 18 RDD
  • 19. ! [Exercise] (The Gettysburg Address) ◦ (The Gettysburg Address)(https:// docs.google.com/file/d/0B5ioqs2Bs0AnZ1Z1TWJET2NuQlU/ view) gettysburg.txt ◦ gettysburg.txt ( ) ● ◦ ◦ ◦ 19 RDD (Word Count ) sc.textFile flatMap split toLowerCase, filter sortBy foreach https://github.com/yclee0418/sparkTeach/blob/master/codeSample/exercise/ WordCount_Rdd.txt take(5) foreach reduceByKey
  • 20. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib 20 Outline
  • 21. ! Scala Scalable Language ( ) ! Scala ◦ lambda expression Scala Scala: List(1,2,3,4,5).foreach(x=>println(item %d.format(x))) Java: Int[] intArr = new Array[] {1,2,3,4,5}; for (int x: intArr) println(String.format(item %d, x)); ! scala Java .NET ! ( actor model akka) ! Spark
  • 22. ! import ◦ import org.apache.spark.SparkContext ◦ import org.apache.spark.rdd._ ( rdd class) ◦ import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel } ( clustering class) ! ◦ val int1: Int = 5 ( error) ◦ var int2: Int = 5 ( ) ◦ val int = 5 ( ) ! ( ) ◦ def voidFunc(param1: Type, param2: Type2) = { … } 22 Scala def setLogger = { Logger.getLogger(com).setLevel(Level.OFF) Logger.getLogger(io).setLevel(Level.OFF) }
  • 23. ! ( ) ◦ def rtnFunc1(param1: Type, param2: Type2): Type3 = { val v1:Type3 = … v1 // } ! ( ) ◦ def rtnFunc2(param1: Type, param2: Type2): (Type3, Type4) = { val v1: Type3 = … val v2: Type4= … (v1, v2) // } 23 Scala def getMinMax(intArr: Array[Int]):(Int,Int) = { val min=intArr.min val max=intArr.max (min, max) }
  • 24. ! ◦ val res = rtnFunc1(param1, param2) ( res ) ◦ val (res1, res2) = rtnFunc2(param1, param2) ( res1,res2 ) ◦ val (_, res2) = rtnFunc2(param1, param2) ( ) ! For Loop ◦ for (i <- collection) { … } ! For Loop ( yield ) ◦ val rtnArr = for (i <- collection) yield { … } 24 Scala val intArr = Array(1,2,3,4,5,6,7,8,9) val multiArr= for (i <- intArr; j <- intArr) yield { i*j } //multiArr 81 99 val (min,max)=getMinMax(intArr) val (_, max)=getMinMax(intArr)
  • 25. ! Tuple ◦ Tuple ◦ val v=(v1,v2,v3...) v._1, v._2, v._3… ◦ lambda ◦ lambda (_) 25 Scala val intArr = Array(1,2,3,4,5,7,8,9) val res=getMinMax(intArr) //res=(1,9)=>tuple val min=res._1 // res val max=res._2 // res val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple val intArr2=intArr.map(x=> (x._1 * x._2 * x._3)) //intArr2: Array[Int] = Array(6, 120, 504) val intArr3=intArr.filter(x=> (x._1 + x._2 > x._3)) //intArr3: Array[(Int, Int, Int)] = Array((4,5,6), (7,8,9)) val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple def getThird(x:(Int,Int,Int)): Int = { (x._3) } val intArr2=intArr.map(getThird(_)) val intArr2=intArr.map(x=>getThird(x)) // //intArr2: Array[Int] = Array(3, 6, 9)
  • 26. ! Class ◦ Scala Class JAVA Class ● private / protected public ● Class 26 Scala Scala: class Person(userID: Int, name: String) // private class Person(val userID: Int, var name: String) // public userID val person = new Person(102, John Smith)// person.userID // 102 Person class Java : public Class Person { private final int userID; private final String name; public Person(int userID, String name) { this.userID = userID; this.name = name; }}
  • 27. ! Object ◦ Scala static instance ◦ Scala Object static ● Scala Object singleton class instance ! Scala Object vs Class ◦ object utility Spark Driver Program ◦ class Entity 27 Scala Scala Object: object Utility { def isNumeric(input: String): Boolean = input.trim() .matches(s[+-]?((d+(ed+)?[lL]?)|(((d+(.d*)?)|(.d+))(ed+)?[fF]?))) def toDouble(input: String): Double = { val rtn = if (input.isEmpty() || !isNumeric(input)) Double.NaN else input.toDouble rtn }} val d = Utility.toDouble(20) // new
  • 28. ! ◦ val intArr = Array(1,2,3,4,5,7,8,9) ! ◦ val intArrExtra = intArr ++ Array(0,11,12) ! map: ! filter: ! join: Map Key Map ! sortBy reverse: ! take(N): N 28 scala val intArr = Array(1,2,3,4,5,7,8,9) val intArr2=intArr.map(_ * 2) //intArr2: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18) val intArr3=intArr.filter(_ > 5) //intArr3: Array[Int] = Array(6, 7, 8, 9) val intArr4=intArr.reverse //intArr4: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)
  • 29. ! sum: ◦ val sum = Array(1,2,3,4,5,7,8,9).sum ! max: ◦ val max = Array(1,2,3,4,5,7,8,9).max ! min: ◦ val max = Array(1,2,3,4,5,7,8,9).min ! distinct: 29 scala val intArr = Array(1,2,3,4,5,7,8,9) val sum = intArr.sum //sum = 45 val max = intArr.max //max = 9 val min = intArr.min //min = 1 val disc = Array(1,1,1,2,2,2,3,3) //disc = Array(1,2,3)
  • 30. ! spark-shell ! ScalaIDE for eclipse 4.4.1 ◦ http://scala-ide.org/download/sdk.html ◦ ◦ ( ) ◦ ◦ ScalaIDE 30 (IDE)
  • 31. ! Driver Program(word complete breakpoint ) ! spark-shell jar ! ◦Eclipse 4.4.2 (Luna) ◦ Scala IDE 4.4.1 ◦ Scala 2.11.8 and Scala 2.10.6 ◦ Sbt 0.13.8 ◦ Scala Worksheet 0.4.0 ◦ Play Framework support 0.6.0 ◦ ScalaTest support 2.10.0 ◦ Scala Refactoring 0.10.0 ◦ Scala Search 0.3.0 ◦ Access to the full Scala IDE ecosystem 31 Scala IDE for eclipse
  • 32. ! Scala IDE Driver Program ◦ Scala Project ◦ Build Path ● Spark ● Scala ◦ package ● package ( ) ◦ scala object ◦ ◦ debug ◦ Jar ◦ spark-submit Spark 32 Scala IDE Driver Program
  • 33. ! Scala IDE ◦ FILE -> NEW -> Scala Project ◦ project FirstScalaProj ◦ JRE 1.8 (1.7 ) ◦ ◦ Finish 33 Scala Project
  • 34. ! Package Explorer Project Explorer FirstScalaProj Build Path -> Configure Build Path 34 Build Path [Tips]: Q: Package Project Explorer A: ! Scala perspective ! Scala perspective -> Window -> Show View
  • 35. ! Spark Driver Program Build Path ◦ jar ◦ Scala Library Container 2.11.8(IDE 2.11.8 ) ! Configure Build Path Java Build Path Libraries - >Add External JARs… ◦Spark Jar Spark /assembly/target/scala-2.11/jars/ ◦ jar ! Java Build Path Scala Library Container 2.11.8 35 Build Path
  • 36. ! Package Explorer FirstScalaProj src package ◦ src ->New->Package( Ctrl N) ◦ bikeSharing Package ! FirstScalaProj data (Folder) input 36 Package
  • 37. ! (gettysburg.txt)copy data ! bikeSharing Package Scala Object BikeCountSort ! 37 Scala Object
  • 38. package practice1 //spark lib import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.rdd._ //log import org.apache.log4j.Logger import org.apache.log4j.Level object WordCount { def main(args: Array[String]): Unit = { // Log Console Logger.getLogger(org).setLevel(Level.ERROR) //mark for MLlib INFO msg val sc = new SparkContext(new SparkConf().setAppName(WordCount).setMaster(local[*])) val rawRdd = sc.textFile(data/gettysburg.txt).flatMap { x=>x.split( ) } // (toLowerCase ) ( filter ) val txtRdd = rawRdd.map { x => x.toLowerCase.trim }.filter { x => !x.equals() } val countRdd = txtRdd.map { x => (x, 1) } // 1) Map val resultRdd = countRdd.reduceByKey { case (a, b) => a + b } // ReduceByKey val sortResRdd = resultRdd.sortBy((x => x._2), false) // sortResRdd.take(5).foreach { println } // sortResRdd.saveAsTextFile(data/wc_output) } } 38 WordCount import Library object main saveAsTextFile
  • 39. ! word complete ALT / word complete ! ( tuple ) 39 IDE
  • 40. ! debug configuration ◦ icon Debug Configurations ◦ Scala Application Debug ● Name WordCount ● Project FirstScalaProj ● Main Class practice1.WordCount ◦ Launcher 40 Debug Configuration
  • 41. ! icon Debug Configuration Debug console 41 [Tips] • data/output sortResRdd ( part-xxxx ) • Log Level console • output
  • 42. ! Spark-Submit JAR ◦ Package Explorer FirstScalaProj - >Export...->Java/JAR file-> FirstScalaProj src JAR File 42 JAR
  • 43. ! input output JAR File ◦ data JAR File 43 Spark-submit
  • 45. ! Command Line JAR File ! Spark-submit ./bin/spark-submit --class <main-class> (package scala object ) --master <master-url> ( master URL local[Worker thread num]) --deploy-mode <deploy-mode> ( Worker Cluster Client Client) --conf <key>=<value> ( Spark ) ... # other options <application-jar> (JAR ) [application-arguments] ( Driver main ) 45 Spark-submit submit JOB Spark /bin/spark-submit --class practice1.WordCount -- master local[*] WordCount.jar [Tips]: ! spark-submit JAR data ! merge output ◦ linux: cat data/output/part-* > res.txt ◦ windows: type dataoutputpart-* > res.txt
  • 46. ! Exercise wordCount Package WordCount2 Object ◦ gettysburg.txt ( ) ● ◦ ● Hint1: (index) ● val posRdd=txtRdd.zipWithIndex() ● Hint2: reduceByKey groupByKey index 46 Word Count
  • 47. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 47 Outline
  • 50. ! (Supervised learning) ◦ (Training Set) ◦ (Features) (Label) ◦ Regression Classification ( ) 50 http://en.proft.me/media/science/ml_svlw.jpg
  • 51. ! (Unsupervised learning) ◦ ( Label) ◦ ◦ Clustering ( KMeans) 51http://www.cnblogs.com/shishanyuan/p/4747761.html
  • 52. ! MLlib Machine Learning library Spark ! ◦ RDD ◦ 52 Spark MLlib http://www.cnblogs.com/shishanyuan/p/4747761.html
  • 54. ! Bike Sharing Dataset ( ) ! https://archive.ics.uci.edu/ml/datasets/ Bike+Sharing+Dataset ◦ ● hour.csv: 2011.01.01~2012.12.30 17,379 ● day.csv: hour.csv 54 Spark MLlib Let’s biking
  • 55. 55 Bike Sharing Dataset Features Label (for hour.csv only) (0 to 6) (1 to 4)
  • 56. ! ◦ (Summary Statistics): MultivariateStatisticalSummary Statistics ◦ Feature ( ) Label ( ) (correlation) Statistics ! ◦ Clustering KMeans ! ◦ Classification Decision Tree LogisticRegressionWithSGD ! ◦ Regression Decision Tree LinearRegressionWithSGD 56 Spark MLlib
  • 57. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 57 Outline
  • 58. 58 ! (Summary Statistics) ◦ ◦ ◦Spark ● 1: RDD[Double/ Float/Int] RDD stats ● 2: RDD[Vector] Statistics.colStats
  • 59. 59 ! (correlation) ◦ (Correlation ) ◦ Spark Pearson Spearman ◦ r Statistics.corr ● 0 < | r | < 0.3 ( ) ● 0.3 <= | r | < 0.7 ( ) ● 0.7 <= | r | < 1 ( ) ● r = 1 ( )
  • 60. 60 A. Scala B. Package Scala Object C. data Folder D. Library
  • 61. ! ScalaIDE Scala folder package Object ◦ SummaryStat ( ) ● src ● bike (package ) ● BikeSummary (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 61
  • 62. 62 A. import B. main Driver Program C. Log D. SparkContext
  • 63. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import Statistics library import org.apache.spark.mllib.stat.{ MultivariateStatisticalSummary, Statistics } object BikeSummary { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeSummary").setMaster("local[*]")) } } ! spark-shell sparkContext sc ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 63
  • 64. 64 ! prepare ◦ input file Features Label RDD
  • 65. ! lines.map features( 3~14 ) label( 17 ) RDD ! RDD : ◦ RDD[Array] ◦ RDD[Tuple] ◦ RDD[BikeShareEntity] prepare def prepare(sc: SparkContext): RDD[???] = { val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) => { if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name) val lines:RDD[Array[String]] = rawDataNoHead.map { x => x.split(, ).map { x => x.trim() } } //split columns with comma val bikeData:RDD[???]=lines.map{ … } //??? depends on your impl } 65
  • 66. ! RDD[Array]: ◦ val bikeData:RDD[Array[Double]] =lines. map{x=>(x.slice(3,13).map(_.toDouble) ++ Array(x(16).toDouble))} ◦ 利弊: prepare實作容易,後面用起來痛苦(要記欄位在Array中的 index),也容易出包 ! RDD[Tuple]: ◦ val bikeData:RDD[(Double, Double, Double, …, Double)] =lines.map{case(season,yr,mnth,…,cnt)=>(season.toDouble, yr.toDouble, mnth.toDouble,…cnt.toDouble)} ◦ 利弊: prepare實作較不易,後面用起來痛苦,比較不會出包(可用較 佳的變數命名來接回傳值) ◦ 例: val features = bikeData.map{case(season,yr,mnth,…,cnt)=> (season, yr, math, …, windspeed)} 66
  • 67. ! RDD[ Class] : ◦ val bikeData:RDD[BikeShareEntity] = lines.map{ x=> BikeShareEntity(⋯)} ◦ 利弊: prepare實作痛苦,後面用起來快樂(用entity物件操作,不 用管欄位位置、抽象化),不易出包 ◦ 例: val labelRdd = bikeData.map{ ent => { ent.label }} Case Class Class case class BikeShareEntity(instant: String,dteday:String,season:Double, yr:Double,mnth:Double,hr:Double,holiday:Double,weekday:Double, workingday:Double,weathersit:Double,temp:Double, atemp:Double,hum:Double,windspeed:Double,casual:Double, registered:Double,cnt:Double) 67 map RDD[BikeShareEntity] val bikeData = rawData.map { x => BikeShareEntity(x(0), x(1), x(2).toDouble, x(3).toDouble,x(4).toDouble, x(5).toDouble, x(6).toDouble,x(7).toDouble,x(8).toDouble, x(9).toDouble,x(10).toDouble,x(11).toDouble,x(12).toDouble, x(13).toDouble,x(14).toDouble,x(15).toDouble,x(16).toDouble) }
  • 68. 68 ! (Class) ! prepare ◦ input file Features Label RDD
  • 69. Entity Class //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import Statistics library import org.apache.spark.mllib.stat. { MultivariateStatisticalSummary, Statistics } object BikeSummary { case class BikeShareEntity(⋯⋯) def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeSummary").setMaster("local[*]")) } } 69
  • 70. 70 ! getFeatures ◦ ! printSummary ◦ console ! printCorrelation ◦ console
  • 71. printSummary def printSummary(entRdd: RDD[BikeShareEntity]) = { val dvRdd = entRdd.map { x => Vectors.dense(getFeatures(x)) } // RDD[Vector] // Statistics.colStats Summary Statistics val summaryAll = Statistics.colStats(dvRdd) println(mean: + summaryAll.mean.toArray.mkString(,))) // println(variance: + summaryAll.variance.toArray.mkString(,))) // } 71 getFeatures def getFeatures(bikeData: BikeShareEntity): Array[Double] = { // val featureArr = Array(bikeData.casual, bikeData.registered,bikeData.cnt) featureArr }
  • 72. 72 printCorrelation def printCorrelation(entRdd: RDD[BikeShareEntity]) = { // RDD[Double] val cntRdd = entRdd.map { x => x.cnt } val yrRdd = entRdd.map { x => x.yr } // val yrCorr = Statistics.corr(yrRdd, cntRdd)// println(correlation: %s vs %s: %f.format(yr, cnt, yrCorr)) val seaRdd = entRdd.map { x => x.season }// season val seaCorr = Statistics.corr(seaRdd, cntRdd) println(correlation: %s vs %s: %f.format(season, cnt, seaCorr)) }
  • 73. A. ◦ BikeSummary.scala SummaryStat ◦ hour.csv data ◦ BikeSummary ( TODO B. ◦ getFeatures printSummary ● console (temp) (hum) (windspeed) ● yr mnth (temp) (hum) (windspeed) (cnt) console 73 for (yr <- 0 to 1) for (mnth <- 1 to 12) { val yrMnRdd = entRdd.filter { ??? }.map { x => Vectors.dense(getFeatures(x)) } val summaryYrMn = Statistics.colStats( ??? ) println(====== summary yr=%d, mnth=%d ==========.format(yr,mnth)) println(mean: + ???) println(variance: + ???) }
  • 74. A. ◦ BikeSummary printCorrelation ◦ hour.csv [yr~windspeed] cnt console B. feature ◦ printCorrelation ● yr mnth feature( yrmo yrmo=yr*12+mnth) yrmo cnt 74
  • 75. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 75 Outline
  • 76. ! Traing Set ( Label) ! (cluster) ! ! ◦ 76 Clustering
  • 77. ! ! (x1,x2,...,xn) K-Means n K (k≤n), (WCSS within-cluster sum of squares) ! A. K B. K C. ( ) D. B C 77 K-Means iteration RUN
  • 79. ! KMeans.train Model(KMeansModel ◦ val model=KMeans.train(data, numClusters, maxIterations, runs) ● data (RDD[Vector]) ● numClusters (K) ● maxIterations run Iteration iteration maxIterations model ● runs KMeans run model ! model.clusterCenters Feature ! model.computeCost WCSS model 79 K-Means in Spark MLlib
  • 80. 80 K-Means BikeSharing ! hour.csv KMeans console ◦ Features yr, season, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed,cnt( cnt Label Feature ) ◦ numClusters 5 ( 5 ) ◦ maxIterations 20 ( run 20 iteration) ◦ runs 3 3 Run model)
  • 81. 81 Model A. Scala B. Package Scala Object C. data Folder D. Library Model K
  • 82. ! ScalaIDE Scala folder package Object ◦ Clustering ( ) ● src ● bike (package ) ● BikeShareClustering (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 82
  • 83. 83 A. import B. main Driver Program C. Log D. SparkContext Model Model K
  • 84. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import KMeans library import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel } object BikeShareClustering { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeClustering").setMaster("local[*]")) } } ! KMeans Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 84
  • 85. 85 ! (Class) ! prepare ◦ input file Features Label RDD ! BikeSummary Model Model K
  • 86. 86 ! getFeatures ◦ ! KMeans ◦ KMeans.train KMeansModel ! getDisplayString ◦ Model Model K
  • 87. getFeatures getDisplayString getFeatures def getFeatures(bikeData: BikeShareEntity): Array[Double] = { val featureArr = Array(bikeData.cnt, bikeData.yr, bikeData.season, bikeData.mnth, bikeData.hr, bikeData.holiday, bikeData.weekday, bikeData.workingday, bikeData.weathersit, bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed, bikeData.casual, bikeData.registered) featureArr } 87 getDisplayString def getDisplayString(centers:Array[Double]): String = { val dispStr = """cnt: %.5f, yr: %.5f, season: %.5f, mnth: %.5f, hr: %.5f, holiday: %.5f, weekday: %.5f, workingday: %.5f, weathersit: %.5f, temp: %.5f, atemp: %.5f, hum: %.5f,windspeed: %.5f, casual: %.5f, registered: %.5f""" .format(centers(0), centers(1),centers(2), centers(3),centers(4), centers(5),centers(6), centers(7),centers(8), centers(9),centers(10), centers(11),centers(12), centers(13),centers(14)) dispStr }
  • 88. KMeans // Features RDD[Vector] val featureRdd = bikeData.map { x => Vectors.dense(getFeatures(x)) } val model = KMeans.train(featureRdd, 5, 20, 3) // K 5 20 Iteration 3 Run 88 var clusterIdx = 0 model.clusterCenters.sortBy { x => x(0) }.foreach { x => { println(center of cluster %d n%s.format(clusterIdx, getDisplayString(x.toArray) )) clusterIdx += 1 } } // Cnt
  • 89. 89 //K-Means import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel } object BikeShareClustering { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger // SparkContext val sc = new SparkContext(new SparkConf().setAppName(BikeClustering).setMaster(local[*])) println(============== preparing data ==================) val bikeData = prepare(sc) // hour.csv RDD[BikeShareEntity] bikeData.persist() println(============== clusting by KMeans ==================) // Features RDD[Vector] val featureRdd = bikeData.map { x => Vectors.dense(getFeatures(x)) } val model = KMeans.train(featureRdd, 5, 20, 3) // K 5 20 Iteration 3 Run var clusterIdx = 0 model.clusterCenters.sortBy { x => x(0) }.foreach { x => { println(center of cluster %d n%s.format(clusterIdx, getDisplayString(x.toArray) )) clusterIdx += 1 } } // Cnt bikeData.unpersist() }
  • 90. 90 ! yr season mnth hr cnt ! weathersit cnt ( ) ! temp atemp cnt ( ) ! hum cnt ( ) ! correlation
  • 91. 91 ! K Model WCSS ! WCSS K Model Model K
  • 92. ! model.computeCost WCSS model (WCSS ) ! numClusters WCSS (K) ! WCSS 92 K-Means println(============== tuning parameters ==================) for (k <- Array(5,10,15,20, 25)) { // numClusters WCSS val iterations = 20 val tm = KMeans.train(featureRdd, k, iterations,3) println(k=%d, WCSS=%f.format(k, tm.computeCost(featureRdd))) } ============== tuning parameters ================== k=5, WCSS=89540755.504054 k=10, WCSS=36566061.126232 k=15, WCSS=23705349.962375 k=20, WCSS=18134353.720998 k=25, WCSS=14282108.404025
  • 93. A. ◦ BikeShareClustering.scala Scala ◦ hour.csv data ◦ BikeShareClustering ( TODO B. feature ◦ BikeClustering ● yrmo getFeatures KMeans console yrmo ● numClusters (ex:50,75,100) 93 K-Means ! K-Means ! KMeans
  • 94. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 94 Outline
  • 95. ! (Binary Classification) (Multi-Class Classification) ! ! ◦ (logistic regression) (decision trees) (naive Bayes) ◦ 95
  • 97. ! import org.apache.spark.mllib.tree.DecisionTree ! import org.apache.spark.mllib.tree.model.DecisionTreeModel ! DecisionTree.trainClassifier Model(DecisionTreeModel ◦ val model=DecisionTree.trainClassifier(trainData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) ● trainData RDD[LabeledPoint] ● numClasses 2 ● categoricalFeaturesInfo trainData categorical Map[ Index, ] continuous ● Map(0->2,4->10) 1,5 categorical 2,10 ● impurity (Gini Entropy) ● maxDepth overfit ● maxBins ● categoricalFeaturesInfo maxBins categoricalFeaturesInfo 97 Decision Tree in Spark MLlib
  • 98. ! ( ) ! threshold( ) ◦ Features yr, season, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed ◦ Label cnt 200 1 0 ◦ numClasses 2 ◦ impurity gini ◦ maxDepth 5 ◦ maxBins 30 98
  • 99. 99 Model A. Scala B. Package Scala Object C. data Folder D. Library Model
  • 100. ! ScalaIDE Scala folder package Object ◦ Classification ( ) ● src ● bike (package ) ● BikeShareClassificationDT (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 100
  • 101. 101 Model Model A. import B. main Driver Program C. Log D. SparkContext
  • 102. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import decision tree library import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel object BikeShareClassificationDT { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationDT").setMaster("local[*]")) } } ! Decision Tree Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 102
  • 103. 103 Model Model ! (Class) ◦ BikeSummary ! prepare ◦ input file Features Label RDD[LabeledPoint] ◦ RDD[LabeledPoint] ! getFeatures ◦ Model feature ! getCategoryInfo ◦ categroyInfoMap
  • 104. 104 prepare def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) => { if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name) val lines:RDD[Array[String]] = rawDataNoHead.map { x => x.split(, ).map { x => x.trim() } } //split columns with comma val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity] val lpData=bikeData.map { x => { val label = if (x.cnt > 200) 1 else 0 //大於200為1,否則為0 val features = Vectors.dense(getFeatures(x)) new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成 } //以6:4的比例隨機分割,將資料切分為訓練及驗證用資料 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData) }
  • 105. 105 getFeatures getCategoryInfo getFeatures方法 def getFeatures(bikeData: BikeShareEntity): Array[Double] = { val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1, bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday, bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed) featureArr } // season Feature 1 getCategoryInfo方法 def getCategoryInfo(): Map[Int, Int]= { val categoryInfoMap = Map[Int, Int]( (/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12), (/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7), (/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4)) categoryInfoMap } //( featureArr index, distinct )
  • 106. 106 Model Model ! trainModel ◦ DecisionTree.trainClassifier Model ! evaluateModel ◦ AUC trainModel Model
  • 107. 107 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int,cateInfo: Map[Int, Int]): (DecisionTreeModel, Double) = { val startTime = new DateTime() // val model = DecisionTree.trainClassifier(trainData, 2, cateInfo, impurity, maxDepth, maxBins) // Model val endTime = new DateTime() // val duration = new Duration(startTime, endTime) // //MyLogger.debug(model.toDebugString) // Decision Tree (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] AUC } val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auc = metrics.areaUnderROC()// areaUnderROC auc auc }
  • 108. 108 Model Model! tuneParameter ◦ impurity Max Depth Max Bin trainModel evaluateModel AUC
  • 109. 109 AUC(Area under the Curve of ROC) Positive (Label 1) Negative (Label 0) Positive (Label 1) true positive(TP) false negative(FN) Negative (Label 0) false positive(FP) true negative(TN) ! (True Pos Rate)TPR 1 1 ◦ TPR=TP/(TP+FN) ! (False Pos Rate)FPR 0 1 ◦ FPR FP/(FP+TN)
  • 110. ! FPR TPR X Y ROC ! AUC ROC 110 AUC AUC 1 100% 0.5 < AUC < 1 AUC 0.5 AUC < 0.5 AUC(Area under the Curve of ROC)
  • 111. 111 tuneParameter def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val impurityArr = Array(gini, entropy) val depthArr = Array(3, 5, 10, 15, 20, 25) val binsArr = Array(50, 100, 200) val evalArr = for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr) yield { // model AUC val (model, duration) = trainModel(trainData, impurity, maxDepth, maxBins, cateInfo) val auc = evaluateModel(validateData, model) println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f .format(impurity, maxDepth, maxBins, auc)) (impurity, maxDepth, maxBins, auc) } val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
  • 112. 112 Decision Tree //MLlib lib import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.evaluation._ import org.apache.spark.mllib.linalg.Vectors //decision tree import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel object BikeShareClassificationDT { case class BikeShareEntity(…) // case class def main(args: Array[String]): Unit = { MyLogger.setLogger val doTrain = (args != null && args.length > 0 && "Y".equals(args(0))) val sc = new SparkContext(new SparkConf().setAppName("ClassificationDT").setMaster("local[*]")) println("============== preparing data ==================") val (trainData, validateData) = prepare(sc) val cateInfo = getCategoryInfo() if (!doTrain) { println("============== train Model (CateInfo)==================") val (modelC, durationC) = trainModel(trainData, "gini", 5, 30, cateInfo) val aucC = evaluateModel(validateData, modelC) println("validate auc(CateInfo)=%f".format(aucC)) } else { println("============== tuning parameters(cateInfo) ==================") tuneParameter(trainData, validateData, cateInfo) } } }
  • 113. A. ◦ BikeShareClassificationDT.scala Scala ◦ hour.csv data ◦ BikeShareClassificationDT ( TODO B. feature ◦ BikeShareClassificationDT ● category AUC ● feature ( |correlation| > 0.1 ) Model AUC 113 Decision Tree ============== tuning parameters(cateInfo) ================== parameter: impurity=gini, maxDepth=3, maxBins=50, auc=0.835524 parameter: impurity=gini, maxDepth=3, maxBins=100, auc=0.835524 parameter: impurity=gini, maxDepth=3, maxBins=200, auc=0.835524 parameter: impurity=gini, maxDepth=5, maxBins=50, auc=0.851846 parameter: impurity=gini, maxDepth=5, maxBins=100, auc=0.851846 parameter: impurity=gini, maxDepth=5, maxBins=200, auc=0.851846
  • 114. ! (simple linear regression, :y=ax+b) (y) ◦ (x) (y) ! (Logistic regression) ◦ ! S (sigmoid) p(probability) 0.5 [ ] [ ] 114
  • 115. ! import org.apache.spark.mllib.classification.{ LogisticRegressionWithSGD, LogisticRegressionModel } ! LogisticRegressionWithSGD.train(trainData, numIterations, stepSize, miniBatchFraction) Model(LogisticRegressionModel ◦val model=LogisticRegressionWithSGD.train(trainData,numIterations, stepSize, miniBatchFraction) ● trainData RDD[LabeledPoint] ● numIterations (SGD) 100 ● stepSize SGD 1 ● miniBatchFraction 0~1 1 115 Logistic Regression in Spark http://www.csie.ntnu.edu.tw/~u91029/Optimization.html
  • 116. ! LogisticRegression train Categorical Feature one-of- k(one-hot) encoding ! One-of-K encoding: ◦ N (N= ) ◦ index 1 0 116 Categorical Features weather Value Clear 1 Mist 2 Light Snow 3 Heavy Rain 4 weathersit Index 1 0 2 1 3 2 4 3 INDEX Map Index Encode 0 1000 1 0100 2 0010 3 0001 Encoding
  • 117. 117 Model A. Scala B. Package Scala Object C. data Folder D. Library Model
  • 118. ! ScalaIDE Scala folder package Object ◦ Classification ( ) ● src ● bike (package ) ● BikeShareClassificationLG (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 118
  • 119. 119 Model Model A. import B. main Driver Program C. Log D. SparkContext
  • 120. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import Logistic library //Logistic import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.feature.StandardScaler import org.apache.spark.mllib.classification.{ LogisticRegressionWithSGD, LogisticRegressionModel } object BikeShareClassificationLG { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationLG").setMaster("local[*]")) } } ! Logistic Regression Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 120
  • 121. 121 Model Model ! (Class) ◦ BikeSummary ! prepare ◦ input file Features Label RDD[LabeledPoint] ◦ RDD[LabeledPoint] ! getFeatures ◦ Model feature ! getCategoryFeature ◦ 1-of-k encode Array[Double]
  • 122. One-Of-K def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { … val bikeData = lines.map{ x => new BikeShareEntity(x) }//RDD[BikeShareEntity] val weatherMap=bikeData. .map { x => x.getField(weathersit) } .distinct().collect().zipWithIndex.toMap //產生Index Map val lpData=bikeData.map { x => { val label = x.getLabel() val features = Vectors.dense(x.getFeatures(weatherMap)) new LabeledPoint(label, features } //LabeledPoint由label及Vector組成 } … } def getFeatures (weatherMap: Map[Double, Int])= { var rtnArr: Array[Double] = Array() var weatherArray:RDD[Double] = Array.ofDim[Double](weatherMap.size) //weatherArray=Array(0,0,0,0) val index = weatherMap(getField(weathersit)) //weathersit=2; index=1 weatherArray(index) = 1 //weatherArray=Array(0,1,0,0) rtnArr = rtnArr ++ weatherArray …. }
  • 123. ! (Standardizes) (variance) / ! StandardScaler def prepare(sc): RDD[LabeledPoint] = { … val featureRdd = bikeData.map { x => Vectors.dense(x.getFeatures(weatherMap)) } //用整個Feature的RDD取得StandardScaler val stdScaler = new StandardScaler(withMean=true, withStd=true).fit(featureRdd ) val lpData2= bikeData.map { x => { val label = x.getLabel() //在建立LabeledPoint前,先對feature作標準化轉換動作 val features = stdScaler.transform(Vectors.dense(x.getFeatures(weatherMap))) new LabeledPoint(label, features) } } …
  • 124. prepare def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { … val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity] val weatherMap=bikeData.map { x => x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map //Standardize val featureRddWithMap = bikeData.map { x => Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))} val stdScalerWithMap = new StandardScaler(withMean = true, withStd = true).fit(featureRddWithMap) // Category feature val lpData = bikeData.map { x => { val label = if (x.cnt > 200) 1 else 0 // 200 1 0 val features = stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))) new LabeledPoint(label, features) }} // 6:4 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData) }
  • 125. 125 getFeatures getFeatures方法 def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int], seasonMap: Map[Double, Int], mnthMap: Map[Double, Int], hrMap: Map[Double, Int], holidayMap: Map[Double, Int], weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int], weatherMap: Map[Double, Int]): Array[Double] = { var featureArr: Array[Double] = Array() // featureArr ++= getCategoryFeature(bikeData.yr, yrMap) featureArr ++= getCategoryFeature(bikeData.season, seasonMap) featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap) featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap) featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap) featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap) featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap) featureArr ++= getCategoryFeature(bikeData.hr, hrMap) // featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed) featureArr }
  • 126. 126 getCategoryFeature getCategoryFeature方法 def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]): Array[Double] = { var featureArray = Array.ofDim[Double](categoryMap.size) val index = categoryMap(fieldVal) featureArray(index) = 1 featureArray }
  • 127. 127 Model Model ! trainModel ◦ DecisionTree.trainClassifier Model ! evaluateModel ◦ AUC trainModel Model
  • 128. 128 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): (LogisticRegressionModel, Double) = { val startTime = new DateTime() // LogisticRegressionWithSGD.train val model = LogisticRegressionWithSGD.train(trainData, numIterations, stepSize, miniBatchFraction) val endTime = new DateTime() val duration = new Duration(startTime, endTime) //MyLogger.debug(model.toPMML()) // model debug (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: LogisticRegressionModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] AUC } val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auc = metrics.areaUnderROC()// areaUnderROC auc auc }
  • 129. 129 Model Model! tuneParameter ◦ iteration stepSize miniBatchFraction trainModel evaluateModel AUC
  • 130. 130 tuneParameter def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val iterationArr: Array[Int] = Array(5, 10, 20, 60,100) val stepSizeArr: Array[Double] = Array(10, 50, 100, 200) val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1) val evalArr = for (iteration <- iterationArr; stepSize <- stepSizeArr; miniBatchFraction <- miniBatchFractionArr) yield { // model AUC val (model, duration) = trainModel(ttrainData, iteration, stepSize, miniBatchFraction) val auc = evaluateModel(validateData, model) println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f .format(iteration, stepSize, miniBatchFraction, auc)) (iteration, stepSize, miniBatchFraction, auc) } val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
  • 131. A. ◦ BikeShareClassificationLG.scala Scala ◦ hour.csv data ◦ BikeShareClassificationLG ( TODO B. feature ◦ BikeShareClassificationLG ● category AUC ● feature ( |correlation| > 0.1 ) Model AUC 131 Logistic Regression ============== tuning parameters(Category) ================== parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=0.500000, auc=0.857073 parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=0.800000, auc=0.855904 parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=1.000000, auc=0.855685 parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=0.500000, auc=0.852388 parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=0.800000, auc=0.852901 parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=1.000000, auc=0.853237 parameter: iteraion=5, stepSize=100.000000, miniBatchFraction=0.500000, auc=0.852087
  • 132. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics) ◦ Clustering ◦ Classification ◦ Regression 132 Outline
  • 133. ! ! ! ◦ (Least Squares) Lasso (ridge regression) 133
  • 134. ! import org.apache.spark.mllib.tree.DecisionTree ! import org.apache.spark.mllib.tree.model.DecisionTreeModel ! DecisionTree.trainRegressor Model(DecisionTreeModel ◦ val model=DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo, impurity, maxDepth, maxBins) ● trainData RDD[LabeledPoint] ● categoricalFeaturesInfo trainData categorical Map[ Index, ] continuous ● Map(0->2,4->10) 1,5 categorical 2,10 ● impurity ( variance) ● maxDepth overfit ● maxBins ● categoricalFeaturesInfo maxBins categoricalFeaturesInfo 134 Decision Tree Regression in Spark
  • 135. ! Model ◦ Features yr, season, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed ◦ Label cnt ◦ impurity gini ◦ maxDepth 5 ◦ maxBins 30 135
  • 136. 136 Model A. Scala B. Package Scala Object C. data Folder D. Library Model
  • 137. ! ScalaIDE Scala folder package Object ◦ Regression ( ) ● src ● bike (package ) ● BikeShareRegressionDT (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 137
  • 138. 138 Model Model A. import B. main Driver Program C. Log D. SparkContext
  • 139. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import decision tree library import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel object BikeShareRegressionDT { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionDT").setMaster("local[*]")) } } ! Decision Tree Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 139
  • 140. 140 Model Model ! (Class) ◦ BikeSummary ! prepare ◦ input file Features Label RDD[LabeledPoint] ◦ RDD[LabeledPoint] ! getFeatures ◦ Model feature ! getCategoryInfo ◦ categroyInfoMap
  • 141. 141 prepare def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) => { if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name) val lines:RDD[Array[String]] = rawDataNoHead.map { x => x.split(, ).map { x => x.trim() } } //split columns with comma val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity] val lpData=bikeData.map { x => { val label = x.cnt //預測目標為租借量欄位 val features = Vectors.dense(getFeatures(x)) new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成 } //以6:4的比例隨機分割,將資料切分為訓練及驗證用資料 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData) }
  • 142. 142 getFeatures getCategoryInfo getFeatures方法 def getFeatures(bikeData: BikeShareEntity): Array[Double] = { val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1, bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday, bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed) featureArr } // season Feature 1 getCategoryInfo方法 def getCategoryInfo(): Map[Int, Int]= { val categoryInfoMap = Map[Int, Int]( (/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12), (/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7), (/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4)) categoryInfoMap } //( featureArr index, distinct )
  • 143. 143 Model Model ! trainModel ◦ DecisionTree.trainRegressor Model ! evaluateModel ◦ RMSE trainModel Model
  • 144. ! (root-mean-square deviation) (root- mean-square error) ! (sample standard deviation) ! 144 RMSE(root-mean-square error)
  • 145. 145 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int, cateInfo: Map[Int,Int]): (DecisionTreeModel, Double) = { val startTime = new DateTime() // val model = DecisionTree.trainRegressor(trainData, cateInfo, impurity, maxDepth, maxBins) // Model val endTime = new DateTime() // val duration = new Duration(startTime, endTime) // //MyLogger.debug(model.toDebugString) // Decision Tree (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] RMSE } val metrics = new RegressionMetrics(scoreAndLabels) val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse rmse }
  • 146. 146 Model Model! tuneParameter ◦ Max Depth Max Bin trainModel evaluateModel RMSE
  • 147. 147 tuneParameter def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val impurityArr = Array(variance) val depthArr = Array(3, 5, 10, 15, 20, 25) val binsArr = Array(50, 100, 200) val evalArr = for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr) yield { // model RMSE val (model, duration) = trainModel(trainData, impurity, maxDepth, maxBins, cateInfo) val rmse = evaluateModel(validateData, model) println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f .format(impurity, maxDepth, maxBins, rmse)) (impurity, maxDepth, maxBins, rmse) } val bestEvalAsc = (evalArr.sortBy(_._4)) val bestEval = bestEvalAsc(0) //RMSE println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, rmse=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
  • 148. A. ◦ BikeShareRegressionDT.scala Scala ◦ hour.csv data ◦ BikeShareRegressionDT.scala ( TODO B. feature ◦ BikeShareRegressionDT ● feature dayType(Double ) dayType ● holiday=0 workingday=0 dataType=0 ● holiday=1 dataType=1 ● holiday=0 workingday=1 dataType=2 ● dayType feature Model( getFeatures getCategoryInfo) ◦ Categorical Info 148 Decision Tree ============== tuning parameters(CateInfo) ================== parameter: impurity=variance, maxDepth=3, maxBins=50, rmse=118.424606 parameter: impurity=variance, maxDepth=3, maxBins=100, rmse=118.424606 parameter: impurity=variance, maxDepth=3, maxBins=200, rmse=118.424606 parameter: impurity=variance, maxDepth=5, maxBins=50, rmse=93.138794 parameter: impurity=variance, maxDepth=5, maxBins=100, rmse=93.138794 parameter: impurity=variance, maxDepth=5, maxBins=200, rmse=93.138794
  • 150. ! import org.apache.spark.mllib.regression.{LinearRegressionWithSGD, LinearRegressionModel} ! LinearRegressionWithSGD.train(trainData, numIterations, stepSize) Model(LinearRegressionModel ◦ val model=LinearRegressionWithSGD.train(trainData, numIterations, stepSize) ● trainData RDD[LabeledPoint] ● numIterations (SGD) ● stepSize SGD 1 stepSize ● miniBatchFraction 0~1 1 150 Least Squares Regression in Spark
  • 151. 151 Model A. Scala B. Package Scala Object C. data Folder D. Library Model
  • 152. ! ScalaIDE Scala folder package Object ◦ Regression ( ) ● src ● bike (package ) ● BikeShareRegressionLR (scala object ) ● data (folder ) ● hour.csv ! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8 152
  • 153. 153 Model Model A. import B. main Driver Program C. Log D. SparkContext
  • 154. //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import linear regression library import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.feature.StandardScaler import org.apache.spark.mllib.classification.{ LinearRegressionWithSGD, LinearRegressionModel } object BikeShareRegressionLR { def main(args: Array[String]): Unit = { Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionLR").setMaster("local[*]")) } } ! Linear Regression Library ! Driver Program sc ◦ appName - Driver Program ◦ master - master URL 154
  • 155. 155 Model Model ! (Class) ◦ BikeSummary ! prepare ◦ input file Features Label RDD[LabeledPoint] ◦ RDD[LabeledPoint] ! getFeatures ◦ Model feature ! getCategoryFeature ◦ 1-of-k encode Array[Double]
  • 156. One-Of-K def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { … val bikeData = lines.map{ x => new BikeShareEntity(x) }//RDD[BikeShareEntity] val weatherMap=bikeData. .map { x => x.getField(weathersit) } .distinct().collect().zipWithIndex.toMap //產生Index Map val lpData=bikeData.map { x => { val label = x.getLabel() val features = Vectors.dense(x.getFeatures(weatherMap)) new LabeledPoint(label, features } //LabeledPoint由label及Vector組成 } … } def getFeatures (weatherMap: Map[Double, Int])= { var rtnArr: Array[Double] = Array() var weatherArray:RDD[Double] = Array.ofDim[Double](weatherMap.size) //weatherArray=Array(0,0,0,0) val index = weatherMap(getField(weathersit)) //weathersit=2; index=1 weatherArray(index) = 1 //weatherArray=Array(0,1,0,0) rtnArr = rtnArr ++ weatherArray …. }
  • 157. ! (Standardizes) (variance) / ! StandardScaler def prepare(sc): RDD[LabeledPoint] = { … val featureRdd = bikeData.map { x => Vectors.dense(x.getFeatures(weatherMap)) } //用整個Feature的RDD取得StandardScaler val stdScaler = new StandardScaler(withMean=true, withStd=true).fit(featureRdd ) val lpData2= bikeData.map { x => { val label = x.getLabel() //在建立LabeledPoint前,先對feature作標準化轉換動作 val features = stdScaler.transform(Vectors.dense(x.getFeatures(weatherMap))) new LabeledPoint(label, features) } } …
  • 158. prepare def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= { … val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity] val weatherMap=bikeData.map { x => x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map //Standardize val featureRddWithMap = bikeData.map { x => Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))} val stdScalerWithMap = new StandardScaler(withMean = true, withStd = true).fit(featureRddWithMap) // Category feature val lpData = bikeData.map { x => { val label = x.cnt // val features = stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))) new LabeledPoint(label, features) }} // 6:4 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData) }
  • 159. 159 getFeatures getFeatures方法 def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int], seasonMap: Map[Double, Int], mnthMap: Map[Double, Int], hrMap: Map[Double, Int], holidayMap: Map[Double, Int], weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int], weatherMap: Map[Double, Int]): Array[Double] = { var featureArr: Array[Double] = Array() // featureArr ++= getCategoryFeature(bikeData.yr, yrMap) featureArr ++= getCategoryFeature(bikeData.season, seasonMap) featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap) featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap) featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap) featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap) featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap) featureArr ++= getCategoryFeature(bikeData.hr, hrMap) // featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed) featureArr }
  • 160. 160 getCategoryFeature getCategoryFeature方法 def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]): Array[Double] = { var featureArray = Array.ofDim[Double](categoryMap.size) val index = categoryMap(fieldVal) featureArray(index) = 1 featureArray }
  • 161. 161 Model Model ! trainModel ◦ DecisionTree.trainRegressor Model ! evaluateModel ◦ RMSE trainModel Model
  • 162. 162 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): (LinearRegressionModel, Double) = { val startTime = new DateTime() // LinearRegressionWithSGD.train val model = LinearRegressionWithSGD.train(trainData, numIterations, stepSize, miniBatchFraction) val endTime = new DateTime() val duration = new Duration(startTime, endTime) //MyLogger.debug(model.toPMML()) // model debug (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: LinearRegressionModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] RMSE } val metrics = new RegressionMetrics(scoreAndLabels) val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse rmse }
  • 163. 163 Model Model! tuneParameter ◦ iteration stepSize miniBatchFraction trainModel evaluateModel RMSE
  • 164. 164 tuneParameter def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val iterationArr: Array[Int] = Array(5, 10, 20, 60,100) val stepSizeArr: Array[Double] = Array(10, 50, 100, 200) val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1) val evalArr = for (iteration <- iterationArr; stepSize <- stepSizeArr; miniBatchFraction <- miniBatchFractionArr) yield { // model RMSE val (model, duration) = trainModel(ttrainData, iteration, stepSize, miniBatchFraction) val rmse = evaluateModel(validateData, model) println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f .format(iteration, stepSize, miniBatchFraction, rmse)) (iteration, stepSize, miniBatchFraction, rmse) } val bestEvalAsc = (evalArr.sortBy(_._4)) val bestEval = bestEvalAsc(0) //RMSE println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
  • 165. A. ◦ BikeShareRegressionLR.scala Scala ◦ hour.csv data ◦ BikeShareRegressionLR.scala ( TODO B. feature ◦ BikeShareRegressionLR ● feature dayType(Double ) dayType ● holiday=0 workingday=0 dataType=0 ● holiday=1 dataType=1 ● holiday=0 workingday=1 dataType=2 ● dayType feature Model( getFeatures getCategoryInfo) ◦ 165 Linear Regression ============== tuning parameters(Category) ================== parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=0.500000, rmse=256.370620 parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=0.800000, rmse=256.376770 parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=1.000000, rmse=256.407185 parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=0.500000, rmse=250.037095 parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=0.800000, rmse=250.062817 parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=1.000000, rmse=250.126173
  • 166. ! Random Forest (multitude) (Decision Tree) ◦ (mode) ◦ (mean) ! ◦ overfit ◦ (missing value) ◦ 166 (RandomForest)
  • 167. ! import org.apache.spark.mllib.tree.RandomForest ! import org.apache.spark.mllib.tree.model.RandomForestModelimport ! RandomForest.trainRegressor Model(RandomForestModel ◦ val model=RandomForest.trainRegressor(trainData, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) ● trainData RDD[LabeledPoint] ● categoricalFeaturesInfo trainData categorical Map[ Index, ] continuous ● Map(0->2,4->10) 1,5 categorical 2,10 ● numTrees ( Model ) ● impurity ( variance) ● featureSubsetStrategy Feature ( auto ) ● maxDepth ● overfit ● ● maxBins ● categoricalFeaturesInfo maxBins categoricalFeaturesInfo 167 Random Forest Regression in Spark
  • 168. 168 trainModel evaluateModel def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int,): (RandomForestModel, Double) = { val startTime = new DateTime() // val cateInfo = BikeShareEntity.getCategoryInfo(true) // categoricalFeaturesInfo val model = RandomForest.trainRegressor(trainData, cateInfo, 3, auto,impurity, maxDepth, maxBins) // Model val endTime = new DateTime() // val duration = new Duration(startTime, endTime) // //MyLogger.debug(model.toDebugString) // Decision Tree (model, duration.getMillis) } def evaluateModel(validateData: RDD[LabeledPoint], model: RandomForestModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] RMSE } val metrics = new RegressionMetrics(scoreAndLabels) val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse rmse }
  • 169. ! [Exercise] ◦ Regression.zip Package Object data Build Path Scala IDE ◦ BikeShareRegressionRF ◦ RandomForest Decision Tree 169 RandomForest Regression
  • 170. ! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ! 170 Outline
  • 171. ! Etu & udn Hadoop Competition 2016 
 ! ETU udn udn ( Open Data) 

  • 172. ! EHC 2015/6 ~ 2015/10 (View) (Order) (Member) 2015/11 (Storeid) (Cat_id1) (Cat_id2) 172
  • 173. 1) Data Feature LabeledPoint Data ◦ Feature : 6~9 View/Order ◦ Label : 10 Order ( 1 0) ◦ Feature : … 2) LabeledPoint Data Training Set Validating Set( 6:4 Split) 3) Training Set Validating Set Machine Learning Model 4) Testing Set ◦ Feature : 6~10 View/Order ◦ Features : 1) 5) 3) Model Testing Set 6) 7) 1) ~ 6) 173
  • 174. ! View/Order uid-storeid-cat_id1-cat_id2 Features ! ( RFM 6~9 View/Order ) ◦ View – viewRecent, viewCnt, viewLast1MCnt, viewLast2MCnt( ,6~9 , , ) ◦ Order – orderRecent, orderCnt, orderLast1MCnt, orderLast2MCnt( ,6~9 , , ) ◦ avgDaySpan, avgViewCnt, lastViewDate, lastViewCnt ( , , , ) 174 – Features(I)
  • 175. ! ◦ gender, ageScore, cityScore( , encoding, encoding) ◦ ageScore: 1~11 ● EX: if (ages.equals(20 )) ageScore = 1 ◦ cityScore: 1~24 ● EX: if (livecity.equals( )) cityScore = 24 ! Miss Value ◦ N ● Gender: 2( ) ● Ages: 35-39 ● City: 175 – Features(II)
  • 176. ! ( ) ◦ http://www.cwb.gov.tw/V7/climate/monthlyData/mD.htm ◦ 6~10 ◦ ◦ ◦ : https://drive.google.com/file/d/0B- b4FvCO9SYoN2VBSVNjN3F3a0U/view?usp=sharing ! 35 Features( uid-storeid-cat_id1-cat_id2) 176 – Features(III)
  • 177. 177 – LabeledPoint Data Sort N Encoding EX: viewCnt( 5 Encoding) 7 3 2 1 viewCnt =5 viewCnt =4 viewCnt =3 viewCnt =2 viewCnt =1
  • 178. ! Xgboost (Extreme Gradient Boosting, ) ◦ Input: LabeledPoint Data(Training Set) ● 35 Features ● Label (1/0 Label=1 0) ◦ Parameter: ● max_depth: Tree ● nround: ● Objective: binary:logistic( ) ◦ Implement: 178 – Machine Learning(I) val param = List(objective -> binary:logistic, max_depth -> 6) val model = XGBoost.train(trainSet, param, nround, 2, null, null)
  • 179. ! Xgboost ◦ Evaluate(with validating Set): ● val predictRes = model.predict(validateSet) ● F_measure ◦ Parameter Tuning: ● max_depth=(5~10) nround=(10~25) ( ) ● : max_depth=6, nround=10 179 – Machine Learning(II) Precision = 0.16669166166766647 F1 measure = 0.15969926394341 Accuracy = 0.15065655700028824 Micro recall = 0.21370309951060 Micro precision = 0.3715258082813 Micro F1 measure = 0.271333885
  • 180. ! Performance Improvement ◦ model N Feature Feature 180 – Machine Learning(III) : 90000ms -> 72000ms(local mode)
  • 181. ! yarn resource manager ◦ spark-submit JOB Worker 181 spark-submit --class ehc.RecommandV4 --deploy-mode cluster -- master yarn ehcFinalV4.jar ! new SparkContext master URL new SparkContext(new SparkConf().setAppName(ehcFinal051).setMaster(local[4])) ➔ SetMaster ( spark-submit )
  • 182. 182 Spark-submit Run Script Sample ###### Script Spark ( yarn Manager) Spark Submit Driver Program ###### ###### for linux-like system ######### # delete output on hdfs first `hadoop fs -rm -R -f /user/team007/data/output` # submit spark job echo -e processing spark job spark-submit --deploy-mode cluster --master yarn --jars lib/jcommon-1.0.23.jar,lib/ joda-time-2.2.jar --class --class ehc.RecommandV4 ehcFinalV4.jar Y # write to result_yyyyMMddHHmmss.txt echo -e write to outFile hadoop fs -cat /user/team007/data/output/part-* > 'result_'`date +%Y%m%d%H%M%S`'.txt'
  • 184. ! Input Single Node ◦ Worker merge ◦ uid-storeid-cat_id1-cat_id2 Sort ! F-Measure ◦ Model ◦ Spark MultilabelMetrics ! ◦ 184 – val scoreAndLabels: RDD[(Array[Double], Array[Double])] = … val metrics = new MultilabelMetrics(scoreAndLabels) println(sF1 measure = ${metrics.f1Measure})
  • 185. ! ! Spark MLlib ◦ Feature Engineering ! Spark MLlib ◦ 185
  • 186. 186