Mais conteúdo relacionado Semelhante a Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera (20) Mais de Cloudera, Inc. (20) Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera1. ‹#›© Cloudera, Inc. All rights reserved.
Introduction to
Apache Spark
& Spark MlLib
Juliet Hougland
4. ‹#›© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
• Easy to Develop
• Rich APIs in Java,
Scala, Python
• Interactive shell
• Fast to Run
• General execution
graphs
• In-memory storage
5. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Apache Spark: Ecosystem
•Dataframes
•ML Lib
•Streaming
6. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Spark in CDH
YARN
Spark
Streaming
GraphX MLlib
HDFS, HBase
HivePig
MapReduce2
SQL
SearchImpala
Spark
Pyspark
7. ‹#›© Cloudera, Inc. All rights reserved.
Spark Execution Model
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
8. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
What do you mean, lazy evaluation?
9. ‹#›© Cloudera, Inc. All rights reserved.
• map
• flatmap
• filter
• distinct
• sample
• union
• intersection
• subtract
• cartesian
Transformations
10. ‹#›© Cloudera, Inc. All rights reserved.
• collect()
• count()
• take(num)
• takeOrdered(num)(ordering)
• reduce(function)
• aggregate(zeroValue)(seqOp,
combOp)
• foreach(function)
Actions
11. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
12. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
13. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
14. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
15. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Count
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
16. ‹#›© Cloudera, Inc. All rights reserved.
Complex, In-Memory Processing
join
filter
groupBy
B: B:
C: D: E:
F:
Ç√Ω
map
A:
map
take
17. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Predicting Churn:
Machine Learning with
Spark MlLib
20. ‹#›© Cloudera, Inc. All rights reserved.
KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.
OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.
NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.
OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.
OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False
The Dataset
30. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Find the entire example at
github.com/jhlch/ds-for-telco
31. ‹#›© Cloudera, Inc. All rights reserved.
Thank You
Juliet Hougland
@j_houg
github.com/jhlch/ds-for-telco
Notas do Editor Each of these records is a person