Student profile product demonstration on grades, ability, well-being and mind...
Big Data Platform adopting Spark and Use Cases with Open Data
1. Jongwook Woo
HiPIC
CSULA
Big Data Platform adopting
Spark and Use Cases with
Open Data
Symposium on the High-Performance
Big Data Analysis Platform 2016
Seoul, Korea
April 28 2016
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
2. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
3. High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Name: 우종욱, Jongwook Woo
Experience:
Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
Since 2007: Exposed to Big Data at CitySearch.com
2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
4. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience (Cont’d): Bring in Big Data R&D
and training to Korea since 2009
Summer 2013 Igloo Security:
– Collect, Search, and Analyze Security Log files 30GB – 100GB /
day
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and Research
Centers
• Yonsei, Gachon
• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
5. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Collaboration
Council Member of IBM Spark Technology Center
City of Los Angeles for OpenHub and Open Data
Startup Companies in Los Angeles
External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
Grants
IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
Partnership
Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata
6. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
7. High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
8. High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity
computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
9. High Performance Information Computing Center
Jongwook Woo
CSULA
What is Hadoop?
9
Hadoop Founder:
o Doug Cutting
Apache Committer:
Lucene, Nutch, …
10. High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
Non-expensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–More public than the traditional super
computers
• You can store and process your applications
– In your university labs, small companies,
research centers
16. High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
다시한번
빅데이터
데이터를 가지고 미래 가치를 예측하는것
– No!
• 빅데이터의 한 응용사례, 우리가 늘 해오던
일일뿐
– 기존의 컴퓨터, DW, DB등으로
빅데이터는 하둡이라는 수퍼컴퓨터를
이용하려는 새로운 접근법
17. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
18. High Performance Information Computing Center
Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
20 ~ 100 times faster than N/W and Disk
– MapReduce
19. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
20. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
21. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Communicate with Spark workers
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–Development and Test
22. High Performance Information Computing Center
Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
Immutable
–RDD, DStream, SchemaRDD, PairRDD
Lineage
–History of the objects
–Automatically and efficiently re-compute lost
data
23. High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
24. High Performance Information Computing Center
Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
25. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
Spark SQL
Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
MLib
Sparse vector support, Decision trees, Linear/Logistic
Regression, PCA
SVD and PCA
26. High Performance Information Computing Center
Jongwook Woo
CSULA
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
Optimizer
Optimizer: build
operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
27. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
28. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS
– No Hadoop ecosystems
29. High Performance Information Computing Center
Jongwook Woo
CSULA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase, Amazon S3,
Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
30. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark with Hadoop YARN
Spark Client
Slave Nodes
ResourceManager (RM) Per Cluster
Create Spark AM and
allocate Containers for Spark AM
NodeManager (NM) Per Node
Spark workers
ApplicationMaster (AM) Per Application
Containers for Spark Executors
Master
Node
Node
Manager
Node
Manager
Node
Manager
Container:
Spark Executor
Spark AM
Resource
Manager
32. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Use Cases
Hadoop Spark Training
33. High Performance Information Computing Center
Jongwook Woo
CSULA
Open Data
USA government
Federal, State, City governments
Expose data to public
USA Business
Twitter, Yelp, …
Expose data to public with APIs
– Some restriction to download
City government
New York
– Taxi, Uber, …
Los Angeles
– Open Data, Open Hub with Geo info
34. High Performance Information Computing Center
Jongwook Woo
CSULA
Data from Industry: Twitter
Data
Systems
Azure HDInsights Spark
8 Nodes
– 40 cores: 2.4GHz Intel Xeon
– Memory - Each Node: 28 GB
Data Source
Keyword ‘alphago’ from Tweeter via Apache NiFi
Data Size
63,193 tweets
Real Time Data Collection period
03/12 – 03/17/2016
– No data collected on 03/13
36. High Performance Information Computing Center
Jongwook Woo
CSULA
Top 10 Countries
# of Tweets per
Country
USA: > 11,000
Japan: > 9,000
Korea: > 1,900
Russia, UK: > 1,600
Thai Land, France : >
1,000
Netherland, Spain,
Ukraine: > 600
38. High Performance Information Computing Center
Jongwook Woo
CSULA
Top 10 Countries
Most Tweeted Countries
All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…
39. High Performance Information Computing Center
Jongwook Woo
CSULA
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016
Alphago vs Lee Sedol
Game 4: Mar 13
Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12
40. High Performance Information Computing Center
Jongwook Woo
CSULA
Ngram words
3 word in row right after Go-Champion
“sedol” and “se-dol”
sedol
se-dol
3-grams Frequency
Again-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168
42. High Performance Information Computing Center
Jongwook Woo
CSULA
Sentiment Map of Lee Se-Dol vs Alphago
YouTube video: “alphago sentiment” by Google
The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb
ToiB8wQ2w14a
43. High Performance Information Computing Center
Jongwook Woo
CSULA
Federal Government: Airline Data Set
Government Open Data
Airline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
47. High Performance Information Computing Center
Jongwook Woo
CSULA
City Government: Crime Data Set
Open Data in City of Los Angeles
Crime Data Set in 2014
Ram Dharan and Sridhar Reddy at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB
– Windows Server 2012 R2 Datacenter
– Extending to last 10 years of data set
48. High Performance Information Computing Center
Jongwook Woo
CSULA
Crime Data
Los Angeles 2014
2%
8%
9%
12%
17%
19%
33%
Total occurences of each Crime
CRIMINAL
VANDALISM
OTHERS
BURGALARY
ASSAULT
TRAFFIC
THEFT
49. High Performance Information Computing Center
Jongwook Woo
CSULA
Total No.of Crimes in 2014
19169
17384
19730
19413
20645
20494
21480
21280
21287
21669
19844
21355
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12
No.of Crimes per Month
51. High Performance Information Computing Center
Jongwook Woo
CSULA
Mapping of Crimes Occurred within
5miles from CSULA
52. High Performance Information Computing Center
Jongwook Woo
CSULA
Mapping of Crimes Occurred within
5miles from UCLA
53. High Performance Information Computing Center
Jongwook Woo
CSULA
Mapping of Crimes Occurred within
5miles from USC
54. High Performance Information Computing Center
Jongwook Woo
CSULA
No. of crimes within 5 miles from CSULA, UCLA
and USC on crime type
0
5000
10000
15000
20000
25000
30000
csula ucla usc
55. High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The Cal State L.A. Hydrogen Research and Fueling
Facility (H2 Station)
opened on May 7, 2014.
56. High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The station
producing hydrogen for Hydrogen Vehicle
Cal State L.A. Hydrogen Research and Fueling
Facility
the first station in the nation to sell hydrogen fuel to
the public.
Hyundai, Toyota
57. High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Workflow
58. High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Model by Manvi Chandra
59. High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
60. High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
Can predict Vehicle Pressure
– Pressure of hydrogen gas within the vehicle Hydrogen
Storage System
– using our model in Azure Visual Studio ML
– Building Spark ML
Decision forest Regression
– constructing a multitude of decision trees at training
time
• the mode of the classes (classification)
• mean prediction (regression) of the individual trees.
61. High Performance Information Computing Center
Jongwook Woo
CSULA
Collaboration
with City of Los Angeles
Wellness and Safety Analysis
How to improve wellness and the safety of
the city
–Expand Information Sharing and Performan
ce Metrics
–Promoting and improving City
employee health and wellness.
–Develop and carry out the City’s safety traini
ng and injury prevention strategy.
62. High Performance Information Computing Center
Jongwook Woo
CSULA
Collaboration
with City of Los Angeles (Cont’d)
Procurement Analysis
How to improve procurement of the city
–Pricing trends
–Supplier diversity
–Cost Optimization
–Invoicing/Billing/Payment Trends
–Material Optimization
–Resource/process efficiencies
63. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
Myself
Introduction To Big Data
Introduction To Spark
Spark and Hadoop
Open Data and Use Cases
Hadoop Spark Training
69. High Performance Information Computing Center
Jongwook Woo
CSULA
새로운 기술 개발 및 교육
하둡, 스파크
R&D및 가치 창출을 위한 새로운
수퍼컴퓨터
70. High Performance Information Computing Center
Jongwook Woo
CSULA
하둡 스파크 교육이 왜 필요한가
새로운 가치 창조, R&D시 필요
미국을 필두로 공학, 과학, 기업등에서 하둡
스파크 빅데이터 교육의 중요성 인지
–데이타 마이닝 및 분석 분야 뿐아니라
대용량 데이터가 있는 모든 분야
중소기업도 Hadoop Cluster 소유가능
–저렴한 수퍼 컴퓨터
그러나,
아무도 하둡 스파크를 가르쳐 주지 않는다
누구에게 교육 받을 것인가?
71. High Performance Information Computing Center
Jongwook Woo
CSULA
하둡 교육 어떻게 시작할 것인가?
기술자들의 Self-study 한계
시간상의 한계: more than a year to be an
expert
Don’t know the detail
Miss many important topics
2014년 우리는 전문가, 국제경쟁 시대에 살고
있음
– 80년대 대학 강의실이 아님
교육비 절약?
기업 생산성 감소
Think USA!
– Training, Training, Training…..
72. High Performance Information Computing Center
Jongwook Woo
CSULA
하둡 교육 어떻게 시작할 것인가? (계속)
IT분야의 각자교육의 한계 인식 필요
실리콘 밸리등 산업계에서 IT기술을 선도함
교육비 절약으로 빅데이터 산업에 뒤쳐짐
산업계 Training program
대한민국 특수성
– 정부 주도의
• 정부, 전문가, 기업을 통한 교육 제공 절실
• 공공기관, 학교는 정부주도로
• 산업체는 자생적으로
Theory Guy양성이 아닌 실무자 양성을 위한
실습용 장비/코드 예제 필요
저가 교육이 아닌 고가 양질 교육 육성 장려
73. High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Training
California State University Los Angeles (Prof
Jongwook Woo)
Supported by Databricks and its cloud computing
services
UC Berkeley Edx (MOOC)
UC Berkeley amplab camp
Stanford
Cloudera, Hortonworks, DataStax Training
courses
IBM Big University
75. High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo
76. High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles
77. High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data Forum at UKC 2016
Forum Chair: jwoo5@calstatela.edu
Amazon AWS, Hortonworks, Couchbase, Qlik…
79. High Performance Information Computing Center
Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
“Market Basket Analysis using Spark”,
Jongwook Woo, in Journal of Science and
Technology, April 2015, Volume 5, No 4,
pp207-209, ISSN 2225-7217, ARPN
https://github.com/hipic/spark_mba, HiPIC
of California State University Los Angenes
80. High Performance Information Computing Center
Jongwook Woo
CSULA
Introduction to Big Data with Apache Spark, databricks
Stanford Spark Class (http://stanford.edu/~rezab )
Cornell University, CS5304
DS320: DataStax Enterprise Analytics with Spark
Cloudera, http://www.cloudera.com
Hortonworks, http://www.hortonworks.com
Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_
3_real-world_use_cases/
References
81. High Performance Information Computing Center
Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Wide” (shuffle) deps:
boundary of stages
“Narrow” deps: A stage pipeline
to be run on the same node
82. High Performance Information Computing Center
Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: A stage pipeline
to be run on the same node
“Wide” (shuffle) deps:
boundary of stages
83. High Performance Information Computing Center
Jongwook Woo
CSULA
Scheduler Optimizations
Pipelines within a
stage 2
map, union
Stage 3:
join algorithms
based on
partitioning
(minimize shuffles) join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task