[DSC 2016] 系列活動：李泳泉 / 星火燎原 - Spark 機器學習初探

–  
 
 
Yung-Chuan Lee
2016.12.18
1

2
Law[Data applications] are like sausages. It is better not to see them being made.
—Otto von Bismarck

! Spark
◦
● Spark
● Scala
● RDD
! LAB
◦ ~
● Spark Scala IDE
! Spark MLlib
◦ …
● Scala + lambda + Spark MLlib
● Clustering Classification Regression
3

! github page: https://github.com/yclee0418/sparkTeach
◦ installation: Spark
◦ codeSample: Spark
● exercise -
● https://github.com/yclee0418/sparkTeach/tree/master/
codeSample/exercise
● final -
● https://github.com/yclee0418/sparkTeach/tree/master/
codeSample/final
4

! Spark
! RDD(Resilient Distributed Datasets)
! Scala
! Spark MLlib
5
Outline

!
◦ 2020 44ZB(IDC 2013~2014)
◦
!
◦ MapReduce by Google(2004)
◦ Hadoop HDFS MapReduce by
Yahoo!(2005)
◦ Spark
Hadoop 10~1000 by AMPLab (2013)
! [ ]Spark
Hadoop
6
–

!
AMPLab
!
! API
◦ Java Scala Python R
! One Stack to rule them all
◦ SQL Streaming
◦ RDD
7
Spark

! Cluster Manager
◦ Standalone – Spark Manager
◦ Apache Mesos
◦ Hadoop YARN
8
Spark

! [exercise]Spark
◦ JDK 1.8
◦ spark-2.0.1.tgz(http://spark.apache.org/downloads.html)
◦ Terminal (for Mac)
● cd /Users/xxxxx/Downloads ( )
● tar -xvzf spark-2.0.1.tgz ( )
● sudo mv spark-2.0.1 /usr/local/spark (spark /usr/local)
● cd /usr/local/spark
● ./build/sbt package( spark 1 )
● ./bin/spark-shell ( Spark shell pwd /
usr/local/spark)
9
Spark (2.0.1)
[Tips]
https://goo.gl/oxNbIX
./bin/run-example org.apache.spark.examples.SparkPi

! Spark Shell Spark command line
◦ Spark
! spark-shell
◦ [ ] Spark binspark-shell
!
◦ var res1: Int = 3 + 5
◦ import org.apache.spark.rdd._
◦ val intRdd: RDD[Int]=sc.parallelize(List(1,2,3,4,5))
◦ intRdd.collect
◦ val txtRdd=sc.textFile(file:///Spark /README.md)
◦ txtRdd.count
! spark-shell
◦ [ ] :quit Ctrl D
10
Spark Shell
Spark
Scala
[Tips]:
➢ var val ?
➢ intRdd txtRdd ?
➢ org. [Tab] ?
➢ http://localhost:4040

! Spark
! RDD(Resilient Distributed Dataset)
! Scala
! Spark MLlib
11
Outline

! Google
! Map Reduce
! MapReduce
◦ Map (K1, V1) ! list(K2, V2)
◦ Reduce (K2, list(V2))!list(K3, V3)
! ( Word Count )
12
RDD MapReduce

! MapReduce on Hadoop
Word Count …
◦ iteration iteration ( )
…
13
Hadoop …
HDFS

! Spark – RDD(Resilient Distribute Datasets)
◦ In-Memory Data Processing and Sharing
◦ (tolerant) (efficient)
!
◦ (lineage) – RDD
◦ lineage
!
◦ Transformations: In memory lazy lineage RDD
◦ Action: return Storage
◦ Persistence: RDD
14
Spark …
: 1+2+3+4+5 = 15
Transformation Action

15
RDD
RDD Ref: http://spark.apache.org/docs/latest/programming-guide.html#transformations

! SparkContext.textFile – RDD
! map: RDD RDD
! filter: RDD RDD
! reduceByKey: RDD Key
RDD Key
! groupByKey: RDD Key RDD
! join cogroup: RDD Key
RDD
! sortBy reverse: RDD
! take(N): RDD N RDD
! saveAsTextFile: RDD
16
RDD

! count: RDD
! collect: RDD Collection(Seq
! head(N): RDD N
! mkString: Collection
17
[Tips]
•
• Transformation

! [Exercise] spark-shell
◦val intRDD = sc.parallelize(List(1,2,3,4,5,6,7,8,9,0))
◦intRDD.map(x => x + 1).collect()
◦intRDD.filter(x => x > 5).collect()
◦intRDD.stats
◦val mapRDD=intRDD.map{x=>(g+(x%3), x)}
◦mapRDD.groupByKey.foreach{x=>println(key: %s,
vals=%s.format(x._1, x._2.mkString(,)))}
◦mapRDD.reduceByKey(_+_).foreach(println)
◦mapRDD.reduceByKey{case(a,b) => a+b}.foreach(println)
18
RDD

! [Exercise] (The Gettysburg Address)
◦ (The Gettysburg Address)(https://
docs.google.com/file/d/0B5ioqs2Bs0AnZ1Z1TWJET2NuQlU/
view) gettysburg.txt
◦ gettysburg.txt ( )
●
◦
◦
◦
19
RDD (Word Count )
sc.textFile flatMap split
toLowerCase, filter
sortBy foreach
https://github.com/yclee0418/sparkTeach/blob/master/codeSample/exercise/
WordCount_Rdd.txt
take(5) foreach
reduceByKey

! Spark
! Scala
! Spark MLlib
20
Outline

! Scala Scalable Language ( )
! Scala
◦ lambda expression
Scala
Scala: List(1,2,3,4,5).foreach(x=>println(item %d.format(x)))
Java:
Int[] intArr = new Array[] {1,2,3,4,5};
for (int x: intArr) println(String.format(item %d, x));
! scala Java .NET
! ( actor model akka)
! Spark

! import
◦ import org.apache.spark.SparkContext
◦ import org.apache.spark.rdd._ ( rdd class)
◦ import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel } (
clustering class)
!
◦ val int1: Int = 5 ( error)
◦ var int2: Int = 5 ( )
◦ val int = 5 ( )
! ( )
◦ def voidFunc(param1: Type, param2: Type2) = { … }
22
Scala
def setLogger = {
Logger.getLogger(com).setLevel(Level.OFF)
Logger.getLogger(io).setLevel(Level.OFF)
}

! ( )
◦ def rtnFunc1(param1: Type, param2: Type2): Type3 = {
val v1:Type3 = …
v1 //
}
! ( )
◦ def rtnFunc2(param1: Type, param2: Type2): (Type3, Type4) = {
val v1: Type3 = …
val v2: Type4= …
(v1, v2)
//
}
23
Scala
def getMinMax(intArr: Array[Int]):(Int,Int) = {
val min=intArr.min
val max=intArr.max
(min, max)
}

!
◦ val res = rtnFunc1(param1, param2) ( res
)
◦ val (res1, res2) = rtnFunc2(param1, param2) (
res1,res2 )
◦ val (_, res2) = rtnFunc2(param1, param2) (
)
! For Loop
◦ for (i <- collection) { … }
! For Loop ( yield )
◦ val rtnArr = for (i <- collection) yield { … }
24
Scala
val intArr = Array(1,2,3,4,5,6,7,8,9)
val multiArr=
for (i <- intArr; j <- intArr)
yield { i*j }
//multiArr 81 99
val (min,max)=getMinMax(intArr)
val (_, max)=getMinMax(intArr)

! Tuple
◦ Tuple
◦ val v=(v1,v2,v3...) v._1, v._2, v._3…
◦ lambda
◦ lambda (_)
25
Scala val intArr = Array(1,2,3,4,5,7,8,9)
val res=getMinMax(intArr) //res=(1,9)=>tuple
val min=res._1 // res
val max=res._2 // res
val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple
val intArr2=intArr.map(x=> (x._1 * x._2 * x._3))
//intArr2: Array[Int] = Array(6, 120, 504)
val intArr3=intArr.filter(x=> (x._1 + x._2 > x._3))
//intArr3: Array[(Int, Int, Int)] = Array((4,5,6), (7,8,9))
val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple
def getThird(x:(Int,Int,Int)): Int = { (x._3) }
val intArr2=intArr.map(getThird(_))
val intArr2=intArr.map(x=>getThird(x)) //
//intArr2: Array[Int] = Array(3, 6, 9)

! Class
◦ Scala Class JAVA Class
● private /
protected public
● Class
26
Scala
Scala:
class Person(userID: Int, name: String) // private
class Person(val userID: Int, var name: String)
// public userID
val person = new Person(102, John Smith)//
person.userID // 102
Person class Java :
public Class Person {
private final int userID;
private final String name;
public Person(int userID, String name) {
this.userID = userID;
this.name = name;
}}

! Object
◦ Scala static
instance
◦ Scala Object static
● Scala Object singleton class instance
! Scala Object vs Class
◦ object utility Spark Driver Program
◦ class Entity
27
Scala
Scala Object:
object Utility {
def isNumeric(input: String): Boolean = input.trim()
.matches(s[+-]?((d+(ed+)?[lL]?)|(((d+(.d*)?)|(.d+))(ed+)?[fF]?)))
def toDouble(input: String): Double = {
val rtn = if (input.isEmpty() || !isNumeric(input)) Double.NaN else input.toDouble
rtn
}}
val d = Utility.toDouble(20) // new

!
◦ val intArr = Array(1,2,3,4,5,7,8,9)
!
◦ val intArrExtra = intArr ++ Array(0,11,12)
! map:
! filter:
! join: Map Key Map
! sortBy reverse:
! take(N): N
28
scala
val intArr = Array(1,2,3,4,5,7,8,9)
val intArr2=intArr.map(_ * 2)
//intArr2: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
val intArr3=intArr.filter(_ > 5)
//intArr3: Array[Int] = Array(6, 7, 8, 9)
val intArr4=intArr.reverse
//intArr4: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)

! sum:
◦ val sum = Array(1,2,3,4,5,7,8,9).sum
! max:
◦ val max = Array(1,2,3,4,5,7,8,9).max
! min:
◦ val max = Array(1,2,3,4,5,7,8,9).min
! distinct:
29
scala
val intArr = Array(1,2,3,4,5,7,8,9)
val sum = intArr.sum
//sum = 45
val max = intArr.max
//max = 9
val min = intArr.min
//min = 1
val disc = Array(1,1,1,2,2,2,3,3)
//disc = Array(1,2,3)

! spark-shell
! ScalaIDE for eclipse 4.4.1
◦ http://scala-ide.org/download/sdk.html
◦
◦ ( )
◦
◦ ScalaIDE
30
(IDE)

! Driver Program(word complete breakpoint
)
! spark-shell jar
!
◦Eclipse 4.4.2 (Luna)
◦ Scala IDE 4.4.1
◦ Scala 2.11.8 and Scala 2.10.6
◦ Sbt 0.13.8
◦ Scala Worksheet 0.4.0
◦ Play Framework support 0.6.0
◦ ScalaTest support 2.10.0
◦ Scala Refactoring 0.10.0
◦ Scala Search 0.3.0
◦ Access to the full Scala IDE ecosystem
31
Scala IDE for eclipse

! Scala IDE Driver Program
◦ Scala Project
◦ Build Path
● Spark
● Scala
◦ package
● package ( )
◦ scala object
◦
◦ debug
◦ Jar
◦ spark-submit Spark
32
Scala IDE Driver Program

! Scala IDE
◦ FILE -> NEW ->
Scala Project
◦ project
FirstScalaProj
◦ JRE 1.8 (1.7 )
◦
◦ Finish
33
Scala Project

! Package Explorer Project Explorer
FirstScalaProj Build Path ->
Configure Build Path
34
Build Path
[Tips]:
Q: Package Project Explorer
A:
! Scala perspective
! Scala perspective
-> Window -> Show View

! Spark Driver Program Build Path
◦ jar
◦ Scala Library Container 2.11.8(IDE 2.11.8 )
! Configure Build Path Java Build Path Libraries -
>Add External JARs…
◦Spark Jar Spark /assembly/target/scala-2.11/jars/
◦ jar
! Java Build Path Scala Library Container 2.11.8
35
Build Path

! Package Explorer FirstScalaProj src
package
◦ src ->New->Package( Ctrl N)
◦ bikeSharing Package
! FirstScalaProj data (Folder) input
36
Package

! (gettysburg.txt)copy data
! bikeSharing Package Scala Object
BikeCountSort
!
37
Scala Object

package practice1
//spark lib
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd._
//log
import org.apache.log4j.Logger
import org.apache.log4j.Level
object WordCount {
def main(args: Array[String]): Unit = {
// Log Console
Logger.getLogger(org).setLevel(Level.ERROR) //mark for MLlib INFO msg
val sc = new SparkContext(new SparkConf().setAppName(WordCount).setMaster(local[*]))
val rawRdd = sc.textFile(data/gettysburg.txt).flatMap { x=>x.split( ) }
// (toLowerCase ) ( filter )
val txtRdd = rawRdd.map { x => x.toLowerCase.trim }.filter { x => !x.equals() }
val countRdd = txtRdd.map { x => (x, 1) } // 1) Map
val resultRdd = countRdd.reduceByKey { case (a, b) => a + b } // ReduceByKey
val sortResRdd = resultRdd.sortBy((x => x._2), false) //
sortResRdd.take(5).foreach { println } //
sortResRdd.saveAsTextFile(data/wc_output)
}
}
38
WordCount
import
Library
object main
saveAsTextFile

! word complete ALT
/ word complete
! ( tuple
)
39
IDE

! debug configuration
◦ icon Debug
Configurations
◦ Scala Application
Debug
● Name WordCount
● Project FirstScalaProj
● Main Class
practice1.WordCount
◦ Launcher
40
Debug Configuration

! icon Debug Configuration
Debug console
41
[Tips]
• data/output sortResRdd ( part-xxxx )
• Log Level console
• output

! Spark-Submit JAR
◦ Package Explorer FirstScalaProj -
>Export...->Java/JAR file-> FirstScalaProj src
JAR File
42
JAR

! input output JAR File
◦ data JAR
File
43
Spark-submit

! spark-submit
44
Spark-submit
1.submit
2. lunch
works
3. return status

! Command Line JAR File
! Spark-submit
./bin/spark-submit
--class <main-class> (package scala object )
--master <master-url> ( master URL local[Worker thread num])
--deploy-mode <deploy-mode> ( Worker Cluster Client Client)
--conf <key>=<value> ( Spark )
... # other options
<application-jar> (JAR )
[application-arguments] ( Driver main )
45
Spark-submit submit JOB
Spark /bin/spark-submit --class practice1.WordCount --
master local[*] WordCount.jar
[Tips]:
! spark-submit JAR data
! merge output
◦ linux: cat data/output/part-* > res.txt
◦ windows: type dataoutputpart-* > res.txt

! Exercise wordCount Package WordCount2
Object
◦ gettysburg.txt ( )
●
◦
● Hint1: (index)
● val posRdd=txtRdd.zipWithIndex()
● Hint2: reduceByKey groupByKey
index
46
Word Count

! Spark
! Scala
! Spark MLlib
◦ (summary statistics)
◦ Clustering
◦ Classification
◦ Regression
47
Outline

!
◦
◦
!
◦
◦
48
Tasks Experience Performance

! (Supervised learning)
◦ (Training Set)
◦ (Features)
(Label)
◦ Regression Classification (
)
50
http://en.proft.me/media/science/ml_svlw.jpg

! (Unsupervised learning)
◦ (
Label)
◦
◦ Clustering ( KMeans)
51http://www.cnblogs.com/shishanyuan/p/4747761.html

! MLlib Machine Learning library Spark
!
◦ RDD
◦
52
Spark MLlib
http://www.cnblogs.com/shishanyuan/p/4747761.html

53
Spark MLlib
https://www.safaribooksonline.com/library/view/spark-for-python/9781784399696/graphics/B03986_04_02.jpg

! Bike Sharing Dataset (
)
! https://archive.ics.uci.edu/ml/datasets/
Bike+Sharing+Dataset
◦
● hour.csv: 2011.01.01~2012.12.30
17,379
● day.csv: hour.csv
54
Spark MLlib Let’s biking

55
Bike Sharing Dataset
Features
Label
(for hour.csv only)
(0 to 6)
(1 to 4)

!
◦ (Summary Statistics):
MultivariateStatisticalSummary Statistics
◦ Feature ( ) Label ( )
(correlation) Statistics
!
◦ Clustering KMeans
!
◦ Classification Decision Tree LogisticRegressionWithSGD
!
◦ Regression Decision Tree LinearRegressionWithSGD
56
Spark MLlib

! Spark
! Scala
! Spark MLlib
◦ Clustering
◦ Classification
◦ Regression
57
Outline

58
! (Summary Statistics)
◦
◦
◦Spark
● 1: RDD[Double/
Float/Int] RDD stats
● 2: RDD[Vector]
Statistics.colStats

59
! (correlation)
◦ (Correlation )
◦ Spark Pearson Spearman
◦ r Statistics.corr
● 0 < | r | < 0.3 ( )
● 0.3 <= | r | < 0.7 ( )
● 0.7 <= | r | < 1 ( )
● r = 1 ( )

60
A. Scala
B. Package Scala Object
C. data Folder
D. Library

! ScalaIDE Scala folder package Object
◦ SummaryStat ( )
● src
● bike (package )
● BikeSummary (scala object )
● data (folder )
● hour.csv
! Build Path
◦ Spark /assembly/target/scala-2.11/jars/
◦ Scala container 2.11.8
61

62
A. import
B. main Driver Program
C. Log
D. SparkContext

//import spark rdd library
import org.apache.spark._
import org.apache.spark.SparkContext._
//import Statistics library
import org.apache.spark.mllib.stat.{ MultivariateStatisticalSummary,
Statistics }
object BikeSummary {
Logger.getLogger(com).setLevel(Level.OFF) //set logger
//initialize SparkContext
val sc = new SparkContext(new SparkConf().setAppName("BikeSummary").setMaster("local[*]"))
}
}
! spark-shell sparkContext sc
! Driver Program sc
◦ appName - Driver Program
◦ master - master URL
63

64
! prepare
◦ input file Features
Label
RDD

! lines.map features( 3~14 ) label( 17 ) RDD
! RDD
:
◦ RDD[Array]
◦ RDD[Tuple]
◦ RDD[BikeShareEntity]
prepare
def prepare(sc: SparkContext): RDD[???] = {
val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder
val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) =>
{ if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name)
val lines:RDD[Array[String]] = rawDataNoHead.map { x =>
x.split(, ).map { x => x.trim() } } //split columns with comma
val bikeData:RDD[???]=lines.map{ … } //??? depends on your impl
}
65

! RDD[Array]:
◦ val bikeData:RDD[Array[Double]] =lines.
map{x=>(x.slice(3,13).map(_.toDouble) ++ Array(x(16).toDouble))}
◦ 利弊: prepare實作容易，後面用起來痛苦(要記欄位在Array中的
index)，也容易出包
! RDD[Tuple]:
◦ val bikeData:RDD[(Double, Double, Double, …, Double)]
=lines.map{case(season,yr,mnth,…,cnt)=>(season.toDouble, yr.toDouble,
mnth.toDouble,…cnt.toDouble)}
◦ 利弊: prepare實作較不易，後面用起來痛苦，比較不會出包(可用較
佳的變數命名來接回傳值)
◦ 例: val features = bikeData.map{case(season,yr,mnth,…,cnt)=> (season, yr,
math, …, windspeed)}
66

! RDD[ Class] :
◦ val bikeData:RDD[BikeShareEntity] = lines.map{ x=> BikeShareEntity(⋯)}
◦ 利弊: prepare實作痛苦，後面用起來快樂(用entity物件操作，不
用管欄位位置、抽象化)，不易出包
◦ 例: val labelRdd = bikeData.map{ ent => { ent.label }}
Case Class Class
case class BikeShareEntity(instant: String,dteday:String,season:Double,
yr:Double,mnth:Double,hr:Double,holiday:Double,weekday:Double,
workingday:Double,weathersit:Double,temp:Double,
atemp:Double,hum:Double,windspeed:Double,casual:Double,
registered:Double,cnt:Double)
67
map RDD[BikeShareEntity]
val bikeData = rawData.map { x =>
BikeShareEntity(x(0), x(1), x(2).toDouble, x(3).toDouble,x(4).toDouble,
x(5).toDouble, x(6).toDouble,x(7).toDouble,x(8).toDouble,
x(9).toDouble,x(10).toDouble,x(11).toDouble,x(12).toDouble,
x(13).toDouble,x(14).toDouble,x(15).toDouble,x(16).toDouble) }

68
! (Class)
! prepare
Label
RDD

Entity Class
//import Statistics library
import org.apache.spark.mllib.stat.
{ MultivariateStatisticalSummary, Statistics }
object BikeSummary {
case class BikeShareEntity(⋯⋯)
val sc = new SparkContext(new
SparkConf().setAppName("BikeSummary").setMaster("local[*]"))
}
}
69

70
! getFeatures
◦
! printSummary
◦ console
! printCorrelation
◦
console

printSummary
def printSummary(entRdd: RDD[BikeShareEntity]) = {
val dvRdd = entRdd.map { x => Vectors.dense(getFeatures(x)) } //
RDD[Vector]
// Statistics.colStats Summary Statistics
val summaryAll = Statistics.colStats(dvRdd)
println(mean: + summaryAll.mean.toArray.mkString(,))) //
println(variance: + summaryAll.variance.toArray.mkString(,))) //
}
71
getFeatures
def getFeatures(bikeData: BikeShareEntity): Array[Double] = {
//
val featureArr = Array(bikeData.casual, bikeData.registered,bikeData.cnt)
featureArr
}

72
printCorrelation
def printCorrelation(entRdd: RDD[BikeShareEntity]) = {
// RDD[Double]
val cntRdd = entRdd.map { x => x.cnt }
val yrRdd = entRdd.map { x => x.yr } //
val yrCorr = Statistics.corr(yrRdd, cntRdd)//
println(correlation: %s vs %s: %f.format(yr, cnt, yrCorr))
val seaRdd = entRdd.map { x => x.season }// season
val seaCorr = Statistics.corr(seaRdd, cntRdd)
println(correlation: %s vs %s: %f.format(season, cnt, seaCorr))
}

A.
◦ BikeSummary.scala SummaryStat
◦ hour.csv data
◦ BikeSummary ( TODO
B.
◦ getFeatures printSummary
● console (temp) (hum) (windspeed)
● yr mnth (temp) (hum)
(windspeed) (cnt) console
73
for (yr <- 0 to 1)
for (mnth <- 1 to 12) {
val yrMnRdd = entRdd.filter { ??? }.map { x => Vectors.dense(getFeatures(x)) }
val summaryYrMn = Statistics.colStats( ??? )
println(====== summary yr=%d, mnth=%d ==========.format(yr,mnth))
println(mean: + ???)
println(variance: + ???)
}

A.
◦ BikeSummary printCorrelation
◦ hour.csv [yr~windspeed] cnt
console
B. feature
◦ printCorrelation
● yr mnth feature( yrmo yrmo=yr*12+mnth)
yrmo cnt
74

! Spark
! Scala
! Spark MLlib
◦ Clustering
◦ Classification
◦ Regression
75
Outline

! Traing Set
( Label)
! (cluster)
!
!
◦
76
Clustering

!
! (x1,x2,...,xn) K-Means n K
(k≤n), (WCSS within-cluster sum of squares)
!
A. K
B. K
C. ( )
D. B C
77
K-Means
iteration
RUN

78
K-Means
ref: http://mropengate.blogspot.tw/2015/06/ai-ch16-5-k-introduction-to-clustering.html

! KMeans.train Model(KMeansModel
◦ val model=KMeans.train(data, numClusters, maxIterations,
runs)
● data (RDD[Vector])
● numClusters (K)
● maxIterations run Iteration
iteration maxIterations model
● runs KMeans run
model
! model.clusterCenters Feature
! model.computeCost WCSS model
79
K-Means in Spark MLlib

80
K-Means BikeSharing
! hour.csv KMeans
console
◦ Features yr, season, mnth, hr, holiday, weekday,
workingday, weathersit, temp, atemp, hum,
windspeed,cnt( cnt Label Feature )
◦ numClusters 5 ( 5 )
◦ maxIterations 20 ( run 20 iteration)
◦ runs 3 3 Run model)

81
Model
A. Scala
C. data Folder
D. Library
Model
K

◦ Clustering ( )
● src
● bike (package )
● BikeShareClustering (scala object )
● data (folder )
● hour.csv
! Build Path
82

83
A. import
C. Log
D. SparkContext
Model
Model
K

//import KMeans library
import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel }
object BikeShareClustering {
val sc = new SparkContext(new SparkConf().setAppName("BikeClustering").setMaster("local[*]"))
}
}
! KMeans Library
! Driver Program sc
84

85
! (Class)
! prepare
Label
RDD
! BikeSummary
Model
Model
K

86
! getFeatures
◦
! KMeans
◦ KMeans.train KMeansModel
! getDisplayString
◦
Model
Model
K

getFeatures getDisplayString
getFeatures
val featureArr = Array(bikeData.cnt, bikeData.yr, bikeData.season,
bikeData.mnth, bikeData.hr, bikeData.holiday, bikeData.weekday,
bikeData.workingday, bikeData.weathersit, bikeData.temp,
bikeData.atemp, bikeData.hum, bikeData.windspeed, bikeData.casual,
bikeData.registered)
featureArr
}
87
getDisplayString
def getDisplayString(centers:Array[Double]): String = {
val dispStr = """cnt: %.5f, yr: %.5f, season: %.5f, mnth: %.5f, hr: %.5f,
holiday: %.5f, weekday: %.5f, workingday: %.5f, weathersit: %.5f, temp:
%.5f, atemp: %.5f, hum: %.5f,windspeed: %.5f, casual: %.5f, registered:
%.5f"""
.format(centers(0), centers(1),centers(2), centers(3),centers(4),
centers(5),centers(6), centers(7),centers(8), centers(9),centers(10),
centers(11),centers(12), centers(13),centers(14))
dispStr
}

KMeans
// Features RDD[Vector]
val featureRdd = bikeData.map { x =>
Vectors.dense(getFeatures(x)) }
val model = KMeans.train(featureRdd, 5, 20, 3) // K 5
20 Iteration 3 Run
88
var clusterIdx = 0
model.clusterCenters.sortBy { x => x(0) }.foreach { x => {
println(center of cluster %d n%s.format(clusterIdx,
getDisplayString(x.toArray) ))
clusterIdx += 1
} } // Cnt

89
//K-Means
import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel }
object BikeShareClustering {
// SparkContext
val sc = new SparkContext(new SparkConf().setAppName(BikeClustering).setMaster(local[*]))
println(============== preparing data ==================)
val bikeData = prepare(sc) // hour.csv RDD[BikeShareEntity]
bikeData.persist()
println(============== clusting by KMeans ==================)
// Features RDD[Vector]
val featureRdd = bikeData.map { x => Vectors.dense(getFeatures(x)) }
val model = KMeans.train(featureRdd, 5, 20, 3) // K 5 20 Iteration 3 Run
var clusterIdx = 0
model.clusterCenters.sortBy { x => x(0) }.foreach { x => {
println(center of cluster %d n%s.format(clusterIdx, getDisplayString(x.toArray) ))
clusterIdx += 1
} } // Cnt
bikeData.unpersist()
}

90
! yr season mnth hr cnt
! weathersit cnt ( )
! temp atemp cnt ( )
! hum cnt ( )
! correlation

91
! K Model
WCSS
! WCSS K
Model
Model
K

! model.computeCost WCSS model
(WCSS )
! numClusters WCSS (K)
! WCSS
92
K-Means
println(============== tuning parameters ==================)
for (k <- Array(5,10,15,20, 25)) {
// numClusters WCSS
val iterations = 20
val tm = KMeans.train(featureRdd, k, iterations,3)
println(k=%d, WCSS=%f.format(k, tm.computeCost(featureRdd)))
}
============== tuning parameters ==================
k=5, WCSS=89540755.504054
k=10, WCSS=36566061.126232
k=15, WCSS=23705349.962375
k=20, WCSS=18134353.720998
k=25, WCSS=14282108.404025

A.
◦ BikeShareClustering.scala Scala
◦ hour.csv data
◦ BikeShareClustering ( TODO
B. feature
◦ BikeClustering
● yrmo getFeatures KMeans
console yrmo
● numClusters (ex:50,75,100)
93
K-Means
! K-Means
! KMeans

! Spark
! Scala
! Spark MLlib
◦ Clustering
◦ Classification
◦ Regression
94
Outline

!
(Binary Classification)
(Multi-Class Classification)
!
!
◦ (logistic regression) (decision
trees) (naive Bayes)
◦
95

!
!
(Features)
(Label)
! (Random Forest)
!
96

! import org.apache.spark.mllib.tree.DecisionTree
! import org.apache.spark.mllib.tree.model.DecisionTreeModel
! DecisionTree.trainClassifier Model(DecisionTreeModel
◦ val model=DecisionTree.trainClassifier(trainData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
● trainData RDD[LabeledPoint]
● numClasses 2
● categoricalFeaturesInfo trainData categorical Map[ Index, ]
continuous
● Map(0->2,4->10) 1,5 categorical 2,10
● impurity (Gini Entropy)
● maxDepth
overfit
● maxBins
● categoricalFeaturesInfo maxBins categoricalFeaturesInfo
97
Decision Tree in Spark MLlib

!
( )
!
threshold( )
◦ Features yr, season, mnth, hr, holiday, weekday,
workingday, weathersit, temp, atemp, hum, windspeed
◦ Label cnt 200 1 0
◦ numClasses 2
◦ impurity gini
◦ maxDepth 5
◦ maxBins 30
98

99
Model
A. Scala
C. data Folder
D. Library
Model

◦ Classification ( )
● src
● bike (package )
● BikeShareClassificationDT (scala object )
● data (folder )
● hour.csv
! Build Path
100

101
Model
Model
A. import
C. Log
D. SparkContext

//import decision tree library
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
object BikeShareClassificationDT {
val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationDT").setMaster("local[*]"))
}
}
! Decision Tree Library
! Driver Program sc
102

103
Model
Model
! (Class)
◦ BikeSummary
! prepare
◦ input file Features Label
RDD[LabeledPoint]
◦ RDD[LabeledPoint]
! getFeatures
◦ Model feature
! getCategoryInfo
◦ categroyInfoMap

104
prepare
def prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity]
val lpData=bikeData.map { x => {
val label = if (x.cnt > 200) 1 else 0 //大於200為1，否則為0
val features = Vectors.dense(getFeatures(x))
new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成
}
//以6:4的比例隨機分割，將資料切分為訓練及驗證用資料
val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4))
(trainData, validateData)
}

105
getFeatures getCategoryInfo
getFeatures方法
val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1,
bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday,
bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)
featureArr
} // season Feature 1
getCategoryInfo方法
def getCategoryInfo(): Map[Int, Int]= {
val categoryInfoMap = Map[Int, Int](
(/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12),
(/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7),
(/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4))
categoryInfoMap
} //( featureArr index, distinct )

106
Model
Model
! trainModel
◦ DecisionTree.trainClassifier Model
! evaluateModel
◦ AUC trainModel Model

107
trainModel evaluateModel
def trainModel(trainData: RDD[LabeledPoint],
impurity: String, maxDepth: Int, maxBins: Int,cateInfo: Map[Int, Int]):
(DecisionTreeModel, Double) = {
val startTime = new DateTime() //
val model = DecisionTree.trainClassifier(trainData, 2, cateInfo, impurity,
maxDepth, maxBins) // Model
val endTime = new DateTime() //
val duration = new Duration(startTime, endTime) //
//MyLogger.debug(model.toDebugString) // Decision Tree
(model, duration.getMillis)
}
def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double =
{
val scoreAndLabels = validateData.map { data =>
var predict = model.predict(data.features)
(predict, data.label) // RDD[( )] AUC
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auc = metrics.areaUnderROC()// areaUnderROC auc
auc
}

108
Model
Model! tuneParameter
◦ impurity Max Depth Max Bin
trainModel
evaluateModel AUC

109
AUC(Area under the Curve of ROC)
Positive
(Label 1)
Negative
(Label 0)
Positive
(Label 1)
true positive(TP) false negative(FN)
Negative
(Label 0)
false
positive(FP)
true negative(TN)
! (True Pos Rate)TPR 1 1
◦ TPR=TP/(TP+FN)
! (False Pos Rate)FPR 0 1
◦ FPR FP/(FP+TN)

! FPR TPR X Y ROC
! AUC ROC
110
AUC
AUC 1
100%
0.5 < AUC < 1
AUC 0.5
AUC < 0.5
AUC(Area under the Curve of ROC)

111
tuneParameter
def tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint])
= {
val impurityArr = Array(gini, entropy)
val depthArr = Array(3, 5, 10, 15, 20, 25)
val binsArr = Array(50, 100, 200)
val evalArr =
for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr)
yield { // model AUC
val (model, duration) = trainModel(trainData, impurity, maxDepth,
maxBins, cateInfo)
val auc = evaluateModel(validateData, model)
println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f
.format(impurity, maxDepth, maxBins, auc))
(impurity, maxDepth, maxBins, auc)
}
val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC
println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f
.format(bestEval._1, bestEval._2, bestEval._3, bestEval._4))
}

112
Decision Tree
//MLlib lib
import org.apache.spark.mllib.evaluation._
import org.apache.spark.mllib.linalg.Vectors
//decision tree
object BikeShareClassificationDT {
case class BikeShareEntity(…) // case class
MyLogger.setLogger
val doTrain = (args != null && args.length > 0 && "Y".equals(args(0)))
val sc = new SparkContext(new SparkConf().setAppName("ClassificationDT").setMaster("local[*]"))
println("============== preparing data ==================")
val (trainData, validateData) = prepare(sc)
val cateInfo = getCategoryInfo()
if (!doTrain) {
println("============== train Model (CateInfo)==================")
val (modelC, durationC) = trainModel(trainData, "gini", 5, 30, cateInfo)
val aucC = evaluateModel(validateData, modelC)
println("validate auc(CateInfo)=%f".format(aucC))
} else {
println("============== tuning parameters(cateInfo) ==================")
tuneParameter(trainData, validateData, cateInfo)
}
}
}

A.
◦ BikeShareClassificationDT.scala Scala
◦ hour.csv data
◦ BikeShareClassificationDT ( TODO
B. feature
◦ BikeShareClassificationDT
● category AUC
● feature ( |correlation| > 0.1 ) Model AUC
113
Decision Tree
============== tuning parameters(cateInfo) ==================
parameter: impurity=gini, maxDepth=3, maxBins=50, auc=0.835524

! (simple linear regression, :y=ax+b)
(y)
◦ (x) (y)
!
(Logistic regression)
◦
! S (sigmoid) p(probability)
0.5 [ ] [ ]
114

! import org.apache.spark.mllib.classification.{
LogisticRegressionWithSGD, LogisticRegressionModel }
! LogisticRegressionWithSGD.train(trainData, numIterations,
stepSize, miniBatchFraction) Model(LogisticRegressionModel
◦val model=LogisticRegressionWithSGD.train(trainData,numIterations,
stepSize, miniBatchFraction)
● numIterations (SGD) 100
● stepSize SGD 1
● miniBatchFraction 0~1
1
115
Logistic Regression in Spark
http://www.csie.ntnu.edu.tw/~u91029/Optimization.html

! LogisticRegression train Categorical
Feature one-of-
k(one-hot) encoding
! One-of-K encoding:
◦ N (N= )
◦ index 1 0
116
Categorical Features
weather Value
Clear 1
Mist 2
Light Snow 3
Heavy Rain 4
weathersit Index
1 0
2 1
3 2
4 3
INDEX
Map
Index Encode
0 1000
1 0100
2 0010
3 0001
Encoding

117
Model
A. Scala
C. data Folder
D. Library
Model

◦ Classification ( )
● src
● bike (package )
● BikeShareClassificationLG (scala object )
● data (folder )
● hour.csv
! Build Path
118

119
Model
Model
A. import
C. Log
D. SparkContext

//import Logistic library
//Logistic
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.classification.{ LogisticRegressionWithSGD, LogisticRegressionModel }
object BikeShareClassificationLG {
val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationLG").setMaster("local[*]"))
}
}
! Logistic Regression Library
! Driver Program sc
120

121
Model
Model
! (Class)
◦ BikeSummary
! prepare
RDD[LabeledPoint]
! getFeatures
◦ Model feature
! getCategoryFeature
◦ 1-of-k encode Array[Double]

One-Of-K
…
val bikeData = lines.map{ x => new BikeShareEntity(x) }//RDD[BikeShareEntity]
val weatherMap=bikeData. .map { x => x.getField(weathersit) }
.distinct().collect().zipWithIndex.toMap //產生Index Map
val label = x.getLabel()
val features = Vectors.dense(x.getFeatures(weatherMap))
new LabeledPoint(label, features } //LabeledPoint由label及Vector組成
} … }
def getFeatures (weatherMap: Map[Double, Int])= {
var rtnArr: Array[Double] = Array()
var weatherArray:RDD[Double] = Array.ofDim[Double](weatherMap.size)
//weatherArray=Array(0,0,0,0)
val index = weatherMap(getField(weathersit)) //weathersit=2; index=1
weatherArray(index) = 1 //weatherArray=Array(0,1,0,0)
rtnArr = rtnArr ++ weatherArray
…. }

! (Standardizes)
(variance) /
!
StandardScaler
def prepare(sc): RDD[LabeledPoint] = { …
val featureRdd = bikeData.map { x => Vectors.dense(x.getFeatures(weatherMap)) }
//用整個Feature的RDD取得StandardScaler
val stdScaler = new StandardScaler(withMean=true, withStd=true).fit(featureRdd )
val lpData2= bikeData.map { x =>
{
val label = x.getLabel()
//在建立LabeledPoint前，先對feature作標準化轉換動作
val features = stdScaler.transform(Vectors.dense(x.getFeatures(weatherMap)))
new LabeledPoint(label, features)
} }
…

prepare
…
val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity]
val weatherMap=bikeData.map { x =>
x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map
//Standardize
val featureRddWithMap = bikeData.map { x =>
Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap,
weekdayMap, workdayMap, weatherMap))}
val stdScalerWithMap = new StandardScaler(withMean = true, withStd =
true).fit(featureRddWithMap)
// Category feature
val lpData = bikeData.map { x => {
val label = if (x.cnt > 200) 1 else 0 // 200 1 0
val features =
stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap,
mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap)))
}}
// 6:4
}

125
getFeatures
getFeatures方法
def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int],
seasonMap: Map[Double, Int], mnthMap: Map[Double, Int],
hrMap: Map[Double, Int], holidayMap: Map[Double, Int],
weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int],
weatherMap: Map[Double, Int]): Array[Double] = {
var featureArr: Array[Double] = Array()
//
featureArr ++= getCategoryFeature(bikeData.yr, yrMap)
featureArr ++= getCategoryFeature(bikeData.season, seasonMap)
featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap)
featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap)
featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap)
featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap)
featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap)
featureArr ++= getCategoryFeature(bikeData.hr, hrMap)
//
featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)
featureArr
}

126
getCategoryFeature
getCategoryFeature方法
def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]):
Array[Double] = {
var featureArray = Array.ofDim[Double](categoryMap.size)
val index = categoryMap(fieldVal)
featureArray(index) = 1
featureArray
}

127
Model
Model
! trainModel
◦ DecisionTree.trainClassifier Model
! evaluateModel
◦ AUC trainModel Model

128
numIterations: Int, stepSize: Double, miniBatchFraction: Double):
(LogisticRegressionModel, Double) = {
val startTime = new DateTime()
// LogisticRegressionWithSGD.train
val model = LogisticRegressionWithSGD.train(trainData, numIterations, stepSize,
miniBatchFraction)
val endTime = new DateTime()
val duration = new Duration(startTime, endTime)
//MyLogger.debug(model.toPMML()) // model debug
}
def evaluateModel(validateData: RDD[LabeledPoint], model: LogisticRegressionModel):
Double = {
(predict, data.label) // RDD[( )] AUC
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auc = metrics.areaUnderROC()// areaUnderROC auc
auc
}

129
Model
◦ iteration stepSize miniBatchFraction
AUC

130
tuneParameter
= {
val iterationArr: Array[Int] = Array(5, 10, 20, 60,100)
val stepSizeArr: Array[Double] = Array(10, 50, 100, 200)
val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1)
val evalArr =
for (iteration <- iterationArr; stepSize <- stepSizeArr;
miniBatchFraction <- miniBatchFractionArr)
yield { // model AUC
val (model, duration) = trainModel(ttrainData, iteration, stepSize,
miniBatchFraction)
val auc = evaluateModel(validateData, model)
println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f
.format(iteration, stepSize, miniBatchFraction, auc))
(iteration, stepSize, miniBatchFraction, auc)
}
val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC
println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f
}

A.
◦ BikeShareClassificationLG.scala Scala
◦ hour.csv data
◦ BikeShareClassificationLG ( TODO
B. feature
◦ BikeShareClassificationLG
● category AUC
● feature ( |correlation| > 0.1 ) Model AUC
131
Logistic Regression
============== tuning parameters(Category) ==================
parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=0.500000, auc=0.857073

! Spark
! Scala
! Spark MLlib
◦ Clustering
◦ Classification
◦ Regression
132
Outline

!
!
!
◦ (Least Squares) Lasso
(ridge regression)
133

! import org.apache.spark.mllib.tree.DecisionTree
! import org.apache.spark.mllib.tree.model.DecisionTreeModel
! DecisionTree.trainRegressor Model(DecisionTreeModel
◦ val model=DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo, impurity,
maxDepth, maxBins)
continuous
● Map(0->2,4->10) 1,5 categorical 2,10
● impurity ( variance)
● maxDepth
overfit
● maxBins
134
Decision Tree Regression in Spark

! Model
◦ Features yr, season, mnth, hr, holiday,
weekday, workingday, weathersit, temp, atemp,
hum, windspeed
◦ Label cnt
◦ impurity gini
◦ maxDepth 5
◦ maxBins 30
135

136
Model
A. Scala
C. data Folder
D. Library
Model

◦ Regression ( )
● src
● bike (package )
● BikeShareRegressionDT (scala object )
● data (folder )
● hour.csv
! Build Path
137

138
Model
Model
A. import
C. Log
D. SparkContext

//import decision tree library
object BikeShareRegressionDT {
val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionDT").setMaster("local[*]"))
}
}
! Decision Tree Library
! Driver Program sc
139

140
Model
Model
! (Class)
◦ BikeSummary
! prepare
RDD[LabeledPoint]
! getFeatures
◦ Model feature
! getCategoryInfo
◦ categroyInfoMap

141
prepare
val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity]
val label = x.cnt //預測目標為租借量欄位
val features = Vectors.dense(getFeatures(x))
new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成
}
//以6:4的比例隨機分割，將資料切分為訓練及驗證用資料
}

142
getFeatures getCategoryInfo
getFeatures方法
val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1,
bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday,
bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)
featureArr
} // season Feature 1
getCategoryInfo方法
def getCategoryInfo(): Map[Int, Int]= {
val categoryInfoMap = Map[Int, Int](
(/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12),
(/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7),
(/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4))
categoryInfoMap
} //( featureArr index, distinct )

143
Model
Model
! trainModel
◦ DecisionTree.trainRegressor Model
! evaluateModel
◦ RMSE trainModel
Model

! (root-mean-square deviation) (root-
mean-square error)
! (sample
standard deviation)
!
144
RMSE(root-mean-square error)

145
impurity: String, maxDepth: Int, maxBins: Int, cateInfo: Map[Int,Int]):
(DecisionTreeModel, Double) = {
val model = DecisionTree.trainRegressor(trainData, cateInfo, impurity, maxDepth,
maxBins) // Model
}
def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double =
{
(predict, data.label) // RDD[( )] RMSE
}
val metrics = new RegressionMetrics(scoreAndLabels)
val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse
rmse
}

146
Model
◦ Max Depth Max Bin
RMSE

147
tuneParameter
= {
val impurityArr = Array(variance)
val depthArr = Array(3, 5, 10, 15, 20, 25)
val binsArr = Array(50, 100, 200)
val evalArr =
for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr)
yield { // model RMSE
val (model, duration) = trainModel(trainData, impurity, maxDepth,
maxBins, cateInfo)
val rmse = evaluateModel(validateData, model)
println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f
.format(impurity, maxDepth, maxBins, rmse))
(impurity, maxDepth, maxBins, rmse)
}
val bestEvalAsc = (evalArr.sortBy(_._4))
val bestEval = bestEvalAsc(0) //RMSE
println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, rmse=%f
}

A.
◦ BikeShareRegressionDT.scala Scala
◦ hour.csv data
◦ BikeShareRegressionDT.scala ( TODO
B. feature
◦ BikeShareRegressionDT
● feature dayType(Double ) dayType
● holiday=0 workingday=0 dataType=0
● holiday=1 dataType=1
● dayType feature Model( getFeatures getCategoryInfo)
◦ Categorical Info
148
Decision Tree
============== tuning parameters(CateInfo) ==================
parameter: impurity=variance, maxDepth=3, maxBins=50, rmse=118.424606

! import org.apache.spark.mllib.regression.{LinearRegressionWithSGD,
LinearRegressionModel}
! LinearRegressionWithSGD.train(trainData, numIterations,
stepSize) Model(LinearRegressionModel
◦ val model=LinearRegressionWithSGD.train(trainData, numIterations,
stepSize)
● numIterations (SGD)
● stepSize SGD 1
stepSize
● miniBatchFraction 0~1
1
150
Least Squares Regression in Spark

151
Model
A. Scala
C. data Folder
D. Library
Model

◦ Regression ( )
● src
● bike (package )
● BikeShareRegressionLR (scala object )
● data (folder )
● hour.csv
! Build Path
152

153
Model
Model
A. import
C. Log
D. SparkContext

//import linear regression library
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.classification.{ LinearRegressionWithSGD, LinearRegressionModel }
object BikeShareRegressionLR {
val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionLR").setMaster("local[*]"))
}
}
! Linear Regression Library
! Driver Program sc
154

155
Model
Model
! (Class)
◦ BikeSummary
! prepare
RDD[LabeledPoint]
! getFeatures
◦ Model feature
! getCategoryFeature
◦ 1-of-k encode Array[Double]

prepare
…
val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity]
val weatherMap=bikeData.map { x =>
x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map
//Standardize
val featureRddWithMap = bikeData.map { x =>
Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap,
weekdayMap, workdayMap, weatherMap))}
val stdScalerWithMap = new StandardScaler(withMean = true, withStd =
true).fit(featureRddWithMap)
// Category feature
val lpData = bikeData.map { x => {
val label = x.cnt //
val features =
stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap,
mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap)))
}}
// 6:4
}

159
getFeatures
getFeatures方法
def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int],
seasonMap: Map[Double, Int], mnthMap: Map[Double, Int],
hrMap: Map[Double, Int], holidayMap: Map[Double, Int],
weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int],
weatherMap: Map[Double, Int]): Array[Double] = {
var featureArr: Array[Double] = Array()
//
featureArr ++= getCategoryFeature(bikeData.yr, yrMap)
featureArr ++= getCategoryFeature(bikeData.season, seasonMap)
featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap)
featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap)
featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap)
featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap)
featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap)
featureArr ++= getCategoryFeature(bikeData.hr, hrMap)
//
featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)
featureArr
}

160
getCategoryFeature
getCategoryFeature方法
def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]):
Array[Double] = {
var featureArray = Array.ofDim[Double](categoryMap.size)
val index = categoryMap(fieldVal)
featureArray(index) = 1
featureArray
}

161
Model
Model
! trainModel
◦ DecisionTree.trainRegressor Model
! evaluateModel
◦ RMSE trainModel
Model

162
numIterations: Int, stepSize: Double, miniBatchFraction: Double):
(LinearRegressionModel, Double) = {
val startTime = new DateTime()
// LinearRegressionWithSGD.train
val model = LinearRegressionWithSGD.train(trainData, numIterations, stepSize,
miniBatchFraction)
val endTime = new DateTime()
val duration = new Duration(startTime, endTime)
//MyLogger.debug(model.toPMML()) // model debug
}
def evaluateModel(validateData: RDD[LabeledPoint], model: LinearRegressionModel):
Double = {
}
rmse
}

163
Model
◦ iteration stepSize miniBatchFraction
RMSE

164
tuneParameter
= {
val iterationArr: Array[Int] = Array(5, 10, 20, 60,100)
val stepSizeArr: Array[Double] = Array(10, 50, 100, 200)
val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1)
val evalArr =
for (iteration <- iterationArr; stepSize <- stepSizeArr;
miniBatchFraction <- miniBatchFractionArr)
yield { // model RMSE
val (model, duration) = trainModel(ttrainData, iteration, stepSize,
miniBatchFraction)
val rmse = evaluateModel(validateData, model)
println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f
.format(iteration, stepSize, miniBatchFraction, rmse))
(iteration, stepSize, miniBatchFraction, rmse)
}
val bestEvalAsc = (evalArr.sortBy(_._4))
val bestEval = bestEvalAsc(0) //RMSE
println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f
}

A.
◦ BikeShareRegressionLR.scala Scala
◦ hour.csv data
◦ BikeShareRegressionLR.scala ( TODO
B. feature
◦ BikeShareRegressionLR
● feature dayType(Double ) dayType
● holiday=1 dataType=1
● dayType feature Model( getFeatures getCategoryInfo)
◦
165
Linear Regression
============== tuning parameters(Category) ==================
parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=0.500000, rmse=256.370620

! Random Forest (multitude) (Decision
Tree)
◦ (mode)
◦ (mean)
!
◦ overfit
◦ (missing value)
◦
166
(RandomForest)

! import org.apache.spark.mllib.tree.RandomForest
! import org.apache.spark.mllib.tree.model.RandomForestModelimport
! RandomForest.trainRegressor Model(RandomForestModel
◦ val model=RandomForest.trainRegressor(trainData, categoricalFeaturesInfo,numTrees,
featureSubsetStrategy, impurity, maxDepth, maxBins)
continuous
● Map(0->2,4->10) 1,5 categorical 2,10
● numTrees ( Model )
● impurity ( variance)
● featureSubsetStrategy Feature ( auto )
● maxDepth
● overfit
●
● maxBins
167
Random Forest Regression in Spark

168
impurity: String, maxDepth: Int, maxBins: Int,): (RandomForestModel, Double) = {
val cateInfo = BikeShareEntity.getCategoryInfo(true) // categoricalFeaturesInfo
val model = RandomForest.trainRegressor(trainData, cateInfo, 3, auto,impurity,
maxDepth, maxBins) // Model
}
def evaluateModel(validateData: RDD[LabeledPoint], model: RandomForestModel): Double =
{
}
rmse
}

! [Exercise]
◦ Regression.zip Package Object
data Build Path Scala
IDE
◦ BikeShareRegressionRF
◦ RandomForest Decision Tree
169
RandomForest Regression

! Spark
! Scala
! Spark MLlib
!
170
Outline

! Etu & udn Hadoop Competition 2016  
! ETU udn udn
(
Open Data)

! EHC 2015/6 ~ 2015/10
(View) (Order)
(Member) 2015/11
(Storeid) (Cat_id1) (Cat_id2)
172

1) Data Feature LabeledPoint Data
◦ Feature : 6~9 View/Order
◦ Label : 10 Order ( 1 0)
◦ Feature : …
2) LabeledPoint Data Training Set Validating Set( 6:4
Split)
3) Training Set Validating Set Machine Learning Model
4) Testing Set
◦ Feature : 6~10 View/Order
◦ Features : 1)
5) 3) Model Testing Set
6)
7) 1) ~ 6)
173

! View/Order uid-storeid-cat_id1-cat_id2
Features
! ( RFM 6~9 View/Order )
◦ View – viewRecent, viewCnt, viewLast1MCnt,
viewLast2MCnt( ,6~9 , ,
)
◦ Order – orderRecent, orderCnt, orderLast1MCnt,
orderLast2MCnt( ,6~9 , ,
)
◦ avgDaySpan, avgViewCnt, lastViewDate, lastViewCnt (
, , ,
)
174
– Features(I)

!
◦ gender, ageScore, cityScore( , encoding,
encoding)
◦ ageScore: 1~11
● EX: if (ages.equals(20 )) ageScore = 1
◦ cityScore: 1~24
● EX: if (livecity.equals( )) cityScore = 24
! Miss Value
◦ N
● Gender: 2( )
● Ages: 35-39
● City:
175
– Features(II)

! ( )
◦ http://www.cwb.gov.tw/V7/climate/monthlyData/mD.htm
◦ 6~10
◦
◦
◦ : https://drive.google.com/file/d/0B-
b4FvCO9SYoN2VBSVNjN3F3a0U/view?usp=sharing
! 35 Features( uid-storeid-cat_id1-cat_id2)
176
– Features(III)

177
– LabeledPoint Data
Sort N
Encoding
EX: viewCnt(
5 Encoding)
7 3 2 1
viewCnt
=5
viewCnt
=4
viewCnt
=3
viewCnt
=2
viewCnt
=1

! Xgboost (Extreme Gradient Boosting, )
◦ Input: LabeledPoint Data(Training Set)
● 35 Features
● Label (1/0 Label=1 0)
◦ Parameter:
● max_depth: Tree
● nround:
● Objective: binary:logistic( )
◦ Implement:
178
– Machine Learning(I)
val param = List(objective -> binary:logistic, max_depth -> 6)
val model = XGBoost.train(trainSet, param, nround, 2, null, null)

! Xgboost
◦ Evaluate(with validating Set):
● val predictRes = model.predict(validateSet)
● F_measure
◦ Parameter Tuning:
● max_depth=(5~10) nround=(10~25)
( )
● : max_depth=6, nround=10
179
– Machine Learning(II)
Precision = 0.16669166166766647 F1 measure = 0.15969926394341
Accuracy = 0.15065655700028824 Micro recall = 0.21370309951060
Micro precision = 0.3715258082813 Micro F1 measure = 0.271333885

! Performance Improvement
◦ model N Feature Feature
180
– Machine Learning(III)
: 90000ms -> 72000ms(local mode)

! yarn resource manager
◦ spark-submit JOB Worker
181
spark-submit --class ehc.RecommandV4 --deploy-mode cluster --
master yarn ehcFinalV4.jar
! new SparkContext master URL
new SparkContext(new
SparkConf().setAppName(ehcFinal051).setMaster(local[4]))
➔ SetMaster ( spark-submit )

182
Spark-submit Run Script Sample
###### Script Spark ( yarn Manager) Spark Submit Driver Program ######
###### for linux-like system #########
# delete output on hdfs first
`hadoop fs -rm -R -f /user/team007/data/output`
# submit spark job
echo -e processing spark job
spark-submit --deploy-mode cluster --master yarn --jars lib/jcommon-1.0.23.jar,lib/
joda-time-2.2.jar --class --class ehc.RecommandV4 ehcFinalV4.jar Y
# write to result_yyyyMMddHHmmss.txt
echo -e write to outFile
hadoop fs -cat /user/team007/data/output/part-* > 'result_'`date +%Y%m%d%H%M%S`'.txt'

! Feature
! Feature
◦
183
–

! Input Single Node
◦ Worker merge
◦ uid-storeid-cat_id1-cat_id2 Sort
! F-Measure
◦ Model
◦ Spark MultilabelMetrics
!
◦
184
–
val scoreAndLabels: RDD[(Array[Double], Array[Double])] = …
val metrics = new MultilabelMetrics(scoreAndLabels)
println(sF1 measure = ${metrics.f1Measure})

!
! Spark MLlib
◦ Feature
Engineering
! Spark MLlib
◦
185

[DSC 2016] 系列活動：李泳泉 / 星火燎原 - Spark 機器學習初探

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a [DSC 2016] 系列活動：李泳泉 / 星火燎原 - Spark 機器學習初探

Semelhante a [DSC 2016] 系列活動：李泳泉 / 星火燎原 - Spark 機器學習初探 (20)

Mais de 台灣資料科學年會

Mais de 台灣資料科學年會 (20)

Último

Último (20)

[DSC 2016] 系列活動：李泳泉 / 星火燎原 - Spark 機器學習初探