Big Data Platform adopting Spark and Use Cases with Open Data

Jongwook Woo
HiPIC
CSULA
Big Data Platform adopting
Spark and Use Cases with
Open Data
Symposium on the High-Performance
Big Data Analysis Platform 2016
Seoul, Korea
April 28 2016
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles

High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training

Jongwook Woo
CSULA
Myself
Name: 우종욱, Jongwook Woo
Experience:
 Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors

Jongwook Woo
CSULA
Experience (Cont’d): Bring in Big Data R&D
and training to Korea since 2009
Summer 2013 Igloo Security:
– Collect, Search, and Analyze Security Log files 30GB – 100GB /
day
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and Research
Centers
• Yonsei, Gachon
• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself

Jongwook Woo
CSULA
Experience in Big Data
 Collaboration
 Council Member of IBM Spark Technology Center
 City of Los Angeles for OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
 Grants
 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata

Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity
computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004

Jongwook Woo
CSULA
What is Hadoop?
9
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …

Jongwook Woo
CSULA
Definition: Big Data
Non-expensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–More public than the traditional super
computers
• You can store and process your applications
– In your university labs, small companies,
research centers

Jongwook Woo
CSULA
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala

Jongwook Woo
CSULA
새로운 툴의 등장

Jongwook Woo
CSULA
새로운 툴의 등장
나가시노 전투

Jongwook Woo
CSULA
나가시노 전투

Jongwook Woo
CSULA
나가시노 전투
3단 발사

Jongwook Woo
CSULA
Definition: Big Data
다시한번
빅데이터
데이터를 가지고 미래 가치를 예측하는것
– No!
• 빅데이터의 한 응용사례, 우리가 늘 해오던
일일뿐
– 기존의 컴퓨터, DW, DB등으로
빅데이터는 하둡이라는 수퍼컴퓨터를
이용하려는 새로운 접근법

Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce

Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
 Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query

Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices

Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Communicate with Spark workers
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–Development and Test

Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
Immutable
–RDD, DStream, SchemaRDD, PairRDD
Lineage
–History of the objects
–Automatically and efficiently re-compute lost
data

Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()

Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java

Jongwook Woo
CSULA
Spark
Spark SQL
Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
MLib
Sparse vector support, Decision trees, Linear/Logistic
Regression, PCA
SVD and PCA

Jongwook Woo
CSULA
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
Optimizer
Optimizer: build
operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed

Jongwook Woo
CSULA
Spark
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS
– No Hadoop ecosystems

Jongwook Woo
CSULA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase, Amazon S3,
Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads

Jongwook Woo
CSULA
Spark with Hadoop YARN
Spark Client
Slave Nodes
 ResourceManager (RM) Per Cluster
 Create Spark AM and
 allocate Containers for Spark AM
 NodeManager (NM) Per Node
 Spark workers
 ApplicationMaster (AM) Per Application
 Containers for Spark Executors
Master
Node
Node
Manager
Node
Manager
Node
Manager
Container:
Spark Executor
Spark AM
Resource
Manager

Jongwook Woo
CSULA
Databricks cluster at CalStateLA

Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Use Cases
 Hadoop Spark Training

Jongwook Woo
CSULA
Open Data
USA government
Federal, State, City governments
Expose data to public
USA Business
Twitter, Yelp, …
Expose data to public with APIs
– Some restriction to download
City government
New York
– Taxi, Uber, …
Los Angeles
– Open Data, Open Hub with Geo info

Jongwook Woo
CSULA
Data from Industry: Twitter
Data
 Systems
Azure HDInsights Spark
8 Nodes
– 40 cores: 2.4GHz Intel Xeon
– Memory - Each Node: 28 GB
 Data Source
Keyword ‘alphago’ from Tweeter via Apache NiFi
 Data Size
 63,193 tweets
 Real Time Data Collection period
03/12 – 03/17/2016
– No data collected on 03/13

Jongwook Woo
CSULA
Top 10 Countries that Tweets
“Alphago”

Jongwook Woo
CSULA
Top 10 Countries
 # of Tweets per
Country
USA: > 11,000
Japan: > 9,000
Korea: > 1,900
Russia, UK: > 1,600
Thai Land, France : >
1,000
 Netherland, Spain,
Ukraine: > 600

Jongwook Woo
CSULA
Top 10 Countries Sentiment
Positive Negative

Jongwook Woo
CSULA
Top 10 Countries
Most Tweeted Countries
 All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…

Jongwook Woo
CSULA
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016
Alphago vs Lee Sedol
Game 4: Mar 13
Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12

Jongwook Woo
CSULA
Ngram words
 3 word in row right after Go-Champion
“sedol” and “se-dol”
sedol
 se-dol
3-grams Frequency
Again-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168

Jongwook Woo
CSULA
Sentiment Map of Alphago
Positive
Negative

Jongwook Woo
CSULA
Sentiment Map of Lee Se-Dol vs Alphago
 YouTube video: “alphago sentiment” by Google
 The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb
ToiB8wQ2w14a

Jongwook Woo
CSULA
Federal Government: Airline Data Set
Government Open Data
Airline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
 Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter

Jongwook Woo
CSULA
Airline Data Set

Jongwook Woo
CSULA
City Government: Crime Data Set
Open Data in City of Los Angeles
Crime Data Set in 2014
Ram Dharan and Sridhar Reddy at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB
– Windows Server 2012 R2 Datacenter
– Extending to last 10 years of data set

Jongwook Woo
CSULA
Crime Data
Los Angeles 2014
2%
8%
9%
12%
17%
19%
33%
Total occurences of each Crime
CRIMINAL
VANDALISM
OTHERS
BURGALARY
ASSAULT
TRAFFIC
THEFT

Jongwook Woo
CSULA
Total No.of Crimes in 2014
19169
17384
19730
19413
20645
20494
21480
21280
21287
21669
19844
21355
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12
No.of Crimes per Month

Jongwook Woo
CSULA
Raw Data Projection on Map

Jongwook Woo
CSULA
Mapping of Crimes Occurred within
5miles from CSULA

Jongwook Woo
CSULA
5miles from UCLA

Jongwook Woo
CSULA
5miles from USC

Jongwook Woo
CSULA
No. of crimes within 5 miles from CSULA, UCLA
and USC on crime type
0
5000
10000
15000
20000
25000
30000
csula ucla usc

Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The Cal State L.A. Hydrogen Research and Fueling
Facility (H2 Station)
opened on May 7, 2014.

Jongwook Woo
CSULA
Prediction Model
The station
producing hydrogen for Hydrogen Vehicle
Cal State L.A. Hydrogen Research and Fueling
Facility
the first station in the nation to sell hydrogen fuel to
the public.
Hyundai, Toyota

Jongwook Woo
CSULA
Prediction Model
Workflow

Jongwook Woo
CSULA
Prediction Model
Model by Manvi Chandra

Jongwook Woo
CSULA
Prediction Model
Results and observations

Jongwook Woo
CSULA
Prediction Model
Results and observations
 Can predict Vehicle Pressure
– Pressure of hydrogen gas within the vehicle Hydrogen
Storage System
– using our model in Azure Visual Studio ML
– Building Spark ML
Decision forest Regression
– constructing a multitude of decision trees at training
time
• the mode of the classes (classification)
• mean prediction (regression) of the individual trees.

Jongwook Woo
CSULA
Collaboration
with City of Los Angeles
Wellness and Safety Analysis
How to improve wellness and the safety of
the city
–Expand Information Sharing and Performan
ce Metrics
–Promoting and improving City
employee health and wellness.
–Develop and carry out the City’s safety traini
ng and injury prevention strategy.

Jongwook Woo
CSULA
Collaboration
with City of Los Angeles (Cont’d)
Procurement Analysis
How to improve procurement of the city
–Pricing trends
–Supplier diversity
–Cost Optimization
–Invoicing/Billing/Payment Trends
–Material Optimization
–Resource/process efficiencies

Jongwook Woo
CSULA
광해군과 청

Jongwook Woo
CSULA
사르후 전투
<만주실록의 사르후 전투 그림. 후금 vs 명군의 전투 장면

Jongwook Woo
CSULA
강홍립과 부차 (후챠) 전투
<만주실록>: 조명연합군의 명 유정군 선봉을 공격하는 만주족 기병

Jongwook Woo
CSULA
조선군 편성
조선측 사료 <충렬록 1770-1790> 정사4간본의 조선군 그림. 활을 든
사수와 조총을 든 포수

Jongwook Woo
CSULA
강홍립과 부차 (후챠) 전투

Jongwook Woo
CSULA
새로운 기술 개발 및 교육
 하둡, 스파크
R&D및 가치 창출을 위한 새로운
수퍼컴퓨터

Jongwook Woo
CSULA
하둡 스파크 교육이 왜 필요한가
새로운 가치 창조, R&D시 필요
미국을 필두로 공학, 과학, 기업등에서 하둡
스파크 빅데이터 교육의 중요성 인지
–데이타 마이닝 및 분석 분야 뿐아니라
대용량 데이터가 있는 모든 분야
중소기업도 Hadoop Cluster 소유가능
–저렴한 수퍼 컴퓨터
그러나,
아무도 하둡 스파크를 가르쳐 주지 않는다
누구에게 교육 받을 것인가?

Jongwook Woo
CSULA
하둡 교육 어떻게 시작할 것인가?
기술자들의 Self-study 한계
시간상의 한계: more than a year to be an
expert
Don’t know the detail
Miss many important topics
2014년 우리는 전문가, 국제경쟁 시대에 살고
있음
– 80년대 대학 강의실이 아님
교육비 절약?
기업 생산성 감소
Think USA!
– Training, Training, Training…..

Jongwook Woo
CSULA
하둡 교육 어떻게 시작할 것인가? (계속)
IT분야의 각자교육의 한계 인식 필요
실리콘 밸리등 산업계에서 IT기술을 선도함
교육비 절약으로 빅데이터 산업에 뒤쳐짐
산업계 Training program
대한민국 특수성
– 정부 주도의
• 정부, 전문가, 기업을 통한 교육 제공 절실
• 공공기관, 학교는 정부주도로
• 산업체는 자생적으로
Theory Guy양성이 아닌 실무자 양성을 위한
실습용 장비/코드 예제 필요
저가 교육이 아닌 고가 양질 교육 육성 장려

Jongwook Woo
CSULA
Spark Training
California State University Los Angeles (Prof
Jongwook Woo)
Supported by Databricks and its cloud computing
services
UC Berkeley Edx (MOOC)
UC Berkeley amplab camp
Stanford
Cloudera, Hortonworks, DataStax Training
courses
IBM Big University

Jongwook Woo
CSULA
Databricks Partners

Jongwook Woo
CSULA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo

Jongwook Woo
CSULA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles

Jongwook Woo
CSULA
Big Data Forum at UKC 2016
Forum Chair: jwoo5@calstatela.edu
Amazon AWS, Hortonworks, Couchbase, Qlik…

Jongwook Woo
CSULA
Question?

Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
 “Market Basket Analysis using Spark”,
Jongwook Woo, in Journal of Science and
Technology, April 2015, Volume 5, No 4,
pp207-209, ISSN 2225-7217, ARPN
https://github.com/hipic/spark_mba, HiPIC
of California State University Los Angenes

Jongwook Woo
CSULA
 Introduction to Big Data with Apache Spark, databricks
 Stanford Spark Class (http://stanford.edu/~rezab )
 Cornell University, CS5304
 DS320: DataStax Enterprise Analytics with Spark
 Cloudera, http://www.cloudera.com
 Hortonworks, http://www.hortonworks.com
 Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_
3_real-world_use_cases/
References

Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Wide” (shuffle) deps:
boundary of stages
“Narrow” deps: A stage pipeline
to be run on the same node

Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: A stage pipeline
to be run on the same node
“Wide” (shuffle) deps:
boundary of stages

Jongwook Woo
CSULA
Scheduler Optimizations
Pipelines within a
stage 2
map, union
Stage 3:
join algorithms
based on
partitioning
(minimize shuffles) join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task

Jongwook Woo
CSULA
Scheduler Optimizations
Conceptually
Stage 1: 3 tasks
Stage 2: 4 tasks
Stage 3: 3 tasks
Total: 3 stages, 10
tasks
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task

Big Data Platform adopting Spark and Use Cases with Open Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (18)

Semelhante a Big Data Platform adopting Spark and Use Cases with Open Data

Semelhante a Big Data Platform adopting Spark and Use Cases with Open Data (20)

Mais de Jongwook Woo

Mais de Jongwook Woo (7)

Último

Último (20)

Big Data Platform adopting Spark and Use Cases with Open Data