SlideShare uma empresa Scribd logo
1 de 84
Jongwook Woo
HiPIC
CSULA
Big Data Platform adopting
Spark and Use Cases with
Open Data
Symposium on the High-Performance
Big Data Analysis Platform 2016
Seoul, Korea
April 28 2016
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Name: 우종욱, Jongwook Woo
Experience:
 Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience (Cont’d): Bring in Big Data R&D
and training to Korea since 2009
Summer 2013 Igloo Security:
– Collect, Search, and Analyze Security Log files 30GB – 100GB /
day
• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training
Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and Research
Centers
• Yonsei, Gachon
• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Collaboration
 Council Member of IBM Spark Technology Center
 City of Los Angeles for OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
 Grants
 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity
computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Hadoop?
9
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …
High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
Non-expensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–More public than the traditional super
computers
• You can store and process your applications
– In your university labs, small companies,
research centers
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
.
.
.
.
.
.
.
.
.
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
High Performance Information Computing Center
Jongwook Woo
CSULA
새로운 툴의 등장
High Performance Information Computing Center
Jongwook Woo
CSULA
새로운 툴의 등장
나가시노 전투
High Performance Information Computing Center
Jongwook Woo
CSULA
나가시노 전투
High Performance Information Computing Center
Jongwook Woo
CSULA
나가시노 전투
3단 발사
High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
다시한번
빅데이터
데이터를 가지고 미래 가치를 예측하는것
– No!
• 빅데이터의 한 응용사례, 우리가 늘 해오던
일일뿐
– 기존의 컴퓨터, DW, DB등으로
빅데이터는 하둡이라는 수퍼컴퓨터를
이용하려는 새로운 접근법
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
 Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Communicate with Spark workers
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–Development and Test
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
Immutable
–RDD, DStream, SchemaRDD, PairRDD
Lineage
–History of the objects
–Automatically and efficiently re-compute lost
data
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
High Performance Information Computing Center
Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
Spark SQL
Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
MLib
Sparse vector support, Decision trees, Linear/Logistic
Regression, PCA
SVD and PCA
High Performance Information Computing Center
Jongwook Woo
CSULA
Scheduling Process
)
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
Optimizer
Optimizer: build
operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS
– No Hadoop ecosystems
High Performance Information Computing Center
Jongwook Woo
CSULA
Block
manager
Task
threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark Driver/Client
(app master) Spark worker(s)
HDFS, HBase, Amazon S3,
Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker
Block
manager
Task
threads
Shuffle tracker
Cluster
manager
Block
manager
Task
threads
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark with Hadoop YARN
Spark Client
Slave Nodes
 ResourceManager (RM) Per Cluster
 Create Spark AM and
 allocate Containers for Spark AM
 NodeManager (NM) Per Node
 Spark workers
 ApplicationMaster (AM) Per Application
 Containers for Spark Executors
Master
Node
Node
Manager
Node
Manager
Node
Manager
Container:
Spark Executor
Spark AM
Resource
Manager
High Performance Information Computing Center
Jongwook Woo
CSULA
Databricks cluster at CalStateLA
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CSULA
Open Data
USA government
Federal, State, City governments
Expose data to public
USA Business
Twitter, Yelp, …
Expose data to public with APIs
– Some restriction to download
City government
New York
– Taxi, Uber, …
Los Angeles
– Open Data, Open Hub with Geo info
High Performance Information Computing Center
Jongwook Woo
CSULA
Data from Industry: Twitter
Data
 Systems
Azure HDInsights Spark
8 Nodes
– 40 cores: 2.4GHz Intel Xeon
– Memory - Each Node: 28 GB
 Data Source
Keyword ‘alphago’ from Tweeter via Apache NiFi
 Data Size
 63,193 tweets
 Real Time Data Collection period
03/12 – 03/17/2016
– No data collected on 03/13
High Performance Information Computing Center
Jongwook Woo
CSULA
Top 10 Countries that Tweets
“Alphago”
High Performance Information Computing Center
Jongwook Woo
CSULA
Top 10 Countries
 # of Tweets per
Country
USA: > 11,000
Japan: > 9,000
Korea: > 1,900
Russia, UK: > 1,600
Thai Land, France : >
1,000
 Netherland, Spain,
Ukraine: > 600
High Performance Information Computing Center
Jongwook Woo
CSULA
Top 10 Countries Sentiment
Positive Negative
High Performance Information Computing Center
Jongwook Woo
CSULA
Top 10 Countries
Most Tweeted Countries
 All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…
High Performance Information Computing Center
Jongwook Woo
CSULA
Daily Tweets in 03/12 –
03/17/2016
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016
Alphago vs Lee Sedol
Game 4: Mar 13
Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12
High Performance Information Computing Center
Jongwook Woo
CSULA
Ngram words
 3 word in row right after Go-Champion
“sedol” and “se-dol”
sedol
 se-dol
3-grams Frequency
Again-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168
High Performance Information Computing Center
Jongwook Woo
CSULA
Sentiment Map of Alphago
Positive
Negative
High Performance Information Computing Center
Jongwook Woo
CSULA
Sentiment Map of Lee Se-Dol vs Alphago
 YouTube video: “alphago sentiment” by Google
 The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb
ToiB8wQ2w14a
High Performance Information Computing Center
Jongwook Woo
CSULA
Federal Government: Airline Data Set
Government Open Data
Airline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
 Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB
– Windows Server 2012 R2 Datacenter
High Performance Information Computing Center
Jongwook Woo
CSULA
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CSULA
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CSULA
Airline Data Set
High Performance Information Computing Center
Jongwook Woo
CSULA
City Government: Crime Data Set
Open Data in City of Los Angeles
Crime Data Set in 2014
Ram Dharan and Sridhar Reddy at HiPIC, CSULA
Microsoft Azure using Hive and Spark SQL
Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB
– Windows Server 2012 R2 Datacenter
– Extending to last 10 years of data set
High Performance Information Computing Center
Jongwook Woo
CSULA
Crime Data
Los Angeles 2014
2%
8%
9%
12%
17%
19%
33%
Total occurences of each Crime
CRIMINAL
VANDALISM
OTHERS
BURGALARY
ASSAULT
TRAFFIC
THEFT
High Performance Information Computing Center
Jongwook Woo
CSULA
Total No.of Crimes in 2014
19169
17384
19730
19413
20645
20494
21480
21280
21287
21669
19844
21355
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12
No.of Crimes per Month
High Performance Information Computing Center
Jongwook Woo
CSULA
Raw Data Projection on Map
High Performance Information Computing Center
Jongwook Woo
CSULA
Mapping of Crimes Occurred within
5miles from CSULA
High Performance Information Computing Center
Jongwook Woo
CSULA
Mapping of Crimes Occurred within
5miles from UCLA
High Performance Information Computing Center
Jongwook Woo
CSULA
Mapping of Crimes Occurred within
5miles from USC
High Performance Information Computing Center
Jongwook Woo
CSULA
No. of crimes within 5 miles from CSULA, UCLA
and USC on crime type
0
5000
10000
15000
20000
25000
30000
csula ucla usc
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The Cal State L.A. Hydrogen Research and Fueling
Facility (H2 Station)
opened on May 7, 2014.
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The station
producing hydrogen for Hydrogen Vehicle
Cal State L.A. Hydrogen Research and Fueling
Facility
the first station in the nation to sell hydrogen fuel to
the public.
Hyundai, Toyota
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Workflow
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Model by Manvi Chandra
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
 Can predict Vehicle Pressure
– Pressure of hydrogen gas within the vehicle Hydrogen
Storage System
– using our model in Azure Visual Studio ML
– Building Spark ML
Decision forest Regression
– constructing a multitude of decision trees at training
time
• the mode of the classes (classification)
• mean prediction (regression) of the individual trees.
High Performance Information Computing Center
Jongwook Woo
CSULA
Collaboration
with City of Los Angeles
Wellness and Safety Analysis
How to improve wellness and the safety of
the city
–Expand Information Sharing and Performan
ce Metrics
–Promoting and improving City
employee health and wellness.
–Develop and carry out the City’s safety traini
ng and injury prevention strategy.
High Performance Information Computing Center
Jongwook Woo
CSULA
Collaboration
with City of Los Angeles (Cont’d)
Procurement Analysis
How to improve procurement of the city
–Pricing trends
–Supplier diversity
–Cost Optimization
–Invoicing/Billing/Payment Trends
–Material Optimization
–Resource/process efficiencies
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Introduction To Spark
 Spark and Hadoop
 Open Data and Use Cases
 Hadoop Spark Training
High Performance Information Computing Center
Jongwook Woo
CSULA
광해군과 청
High Performance Information Computing Center
Jongwook Woo
CSULA
사르후 전투
<만주실록의 사르후 전투 그림. 후금 vs 명군의 전투 장면
High Performance Information Computing Center
Jongwook Woo
CSULA
강홍립과 부차 (후챠) 전투
<만주실록>: 조명연합군의 명 유정군 선봉을 공격하는 만주족 기병
High Performance Information Computing Center
Jongwook Woo
CSULA
조선군 편성
조선측 사료 <충렬록 1770-1790> 정사4간본의 조선군 그림. 활을 든
사수와 조총을 든 포수
High Performance Information Computing Center
Jongwook Woo
CSULA
강홍립과 부차 (후챠) 전투
High Performance Information Computing Center
Jongwook Woo
CSULA
새로운 기술 개발 및 교육
 하둡, 스파크
R&D및 가치 창출을 위한 새로운
수퍼컴퓨터
High Performance Information Computing Center
Jongwook Woo
CSULA
하둡 스파크 교육이 왜 필요한가
새로운 가치 창조, R&D시 필요
미국을 필두로 공학, 과학, 기업등에서 하둡
스파크 빅데이터 교육의 중요성 인지
–데이타 마이닝 및 분석 분야 뿐아니라
대용량 데이터가 있는 모든 분야
중소기업도 Hadoop Cluster 소유가능
–저렴한 수퍼 컴퓨터
그러나,
아무도 하둡 스파크를 가르쳐 주지 않는다
누구에게 교육 받을 것인가?
High Performance Information Computing Center
Jongwook Woo
CSULA
하둡 교육 어떻게 시작할 것인가?
기술자들의 Self-study 한계
시간상의 한계: more than a year to be an
expert
Don’t know the detail
Miss many important topics
2014년 우리는 전문가, 국제경쟁 시대에 살고
있음
– 80년대 대학 강의실이 아님
교육비 절약?
기업 생산성 감소
Think USA!
– Training, Training, Training…..
High Performance Information Computing Center
Jongwook Woo
CSULA
하둡 교육 어떻게 시작할 것인가? (계속)
IT분야의 각자교육의 한계 인식 필요
실리콘 밸리등 산업계에서 IT기술을 선도함
교육비 절약으로 빅데이터 산업에 뒤쳐짐
산업계 Training program
대한민국 특수성
– 정부 주도의
• 정부, 전문가, 기업을 통한 교육 제공 절실
• 공공기관, 학교는 정부주도로
• 산업체는 자생적으로
Theory Guy양성이 아닌 실무자 양성을 위한
실습용 장비/코드 예제 필요
저가 교육이 아닌 고가 양질 교육 육성 장려
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Training
California State University Los Angeles (Prof
Jongwook Woo)
Supported by Databricks and its cloud computing
services
UC Berkeley Edx (MOOC)
UC Berkeley amplab camp
Stanford
Cloudera, Hortonworks, DataStax Training
courses
IBM Big University
High Performance Information Computing Center
Jongwook Woo
CSULA
Databricks Partners
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data Forum at UKC 2016
Forum Chair: jwoo5@calstatela.edu
Amazon AWS, Hortonworks, Couchbase, Qlik…
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?
High Performance Information Computing Center
Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )
 “Market Basket Analysis using Spark”,
Jongwook Woo, in Journal of Science and
Technology, April 2015, Volume 5, No 4,
pp207-209, ISSN 2225-7217, ARPN
https://github.com/hipic/spark_mba, HiPIC
of California State University Los Angenes
High Performance Information Computing Center
Jongwook Woo
CSULA
 Introduction to Big Data with Apache Spark, databricks
 Stanford Spark Class (http://stanford.edu/~rezab )
 Cornell University, CS5304
 DS320: DataStax Enterprise Analytics with Spark
 Cloudera, http://www.cloudera.com
 Hortonworks, http://www.hortonworks.com
 Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_
3_real-world_use_cases/
References
High Performance Information Computing Center
Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Wide” (shuffle) deps:
boundary of stages
“Narrow” deps: A stage pipeline
to be run on the same node
High Performance Information Computing Center
Jongwook Woo
CSULA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: A stage pipeline
to be run on the same node
“Wide” (shuffle) deps:
boundary of stages
High Performance Information Computing Center
Jongwook Woo
CSULA
Scheduler Optimizations
Pipelines within a
stage 2
map, union
Stage 3:
join algorithms
based on
partitioning
(minimize shuffles) join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
High Performance Information Computing Center
Jongwook Woo
CSULA
Scheduler Optimizations
Conceptually
Stage 1: 3 tasks
Stage 2: 4 tasks
Stage 3: 3 tasks
Total: 3 stages, 10
tasks
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task

Mais conteúdo relacionado

Mais procurados

Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Jongwook Woo
 
hydrogenbigdataanalysis
hydrogenbigdataanalysishydrogenbigdataanalysis
hydrogenbigdataanalysisManvi Chandra
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IEdureka!
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaEdureka!
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka Edureka!
 

Mais procurados (19)

Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
hydrogenbigdataanalysis
hydrogenbigdataanalysishydrogenbigdataanalysis
hydrogenbigdataanalysis
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
 

Destaque

Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 
Blockchain คืออะไร
Blockchain คืออะไรBlockchain คืออะไร
Blockchain คืออะไรIMC Institute
 
Fin de ciclo ecologia
Fin de ciclo ecologiaFin de ciclo ecologia
Fin de ciclo ecologiaCindyta Dami
 
Big data using Public Cloud
Big data using Public CloudBig data using Public Cloud
Big data using Public CloudIMC Institute
 
Technology Trends ผลกระต่อธุรกิจการธนาคาร
Technology Trends ผลกระต่อธุรกิจการธนาคารTechnology Trends ผลกระต่อธุรกิจการธนาคาร
Technology Trends ผลกระต่อธุรกิจการธนาคารIMC Institute
 
บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558 บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558 IMC Institute
 
Big data project management
Big data project managementBig data project management
Big data project managementIMC Institute
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveAnalyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveIMC Institute
 
IT Trends eMagazine Vol 2. No.5 ของ IMC Institiute
IT Trends eMagazine  Vol 2. No.5 ของ IMC InstitiuteIT Trends eMagazine  Vol 2. No.5 ของ IMC Institiute
IT Trends eMagazine Vol 2. No.5 ของ IMC InstitiuteIMC Institute
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
เทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษาเทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษา
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษาIMC Institute
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2IMC Institute
 

Destaque (18)

Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
ITSS Overview
ITSS OverviewITSS Overview
ITSS Overview
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
Blockchain คืออะไร
Blockchain คืออะไรBlockchain คืออะไร
Blockchain คืออะไร
 
Fin de ciclo ecologia
Fin de ciclo ecologiaFin de ciclo ecologia
Fin de ciclo ecologia
 
Big data using Public Cloud
Big data using Public CloudBig data using Public Cloud
Big data using Public Cloud
 
Technology Trends ผลกระต่อธุรกิจการธนาคาร
Technology Trends ผลกระต่อธุรกิจการธนาคารTechnology Trends ผลกระต่อธุรกิจการธนาคาร
Technology Trends ผลกระต่อธุรกิจการธนาคาร
 
บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558 บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558
 
Big data project management
Big data project managementBig data project management
Big data project management
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveAnalyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
 
IT Trends eMagazine Vol 2. No.5 ของ IMC Institiute
IT Trends eMagazine  Vol 2. No.5 ของ IMC InstitiuteIT Trends eMagazine  Vol 2. No.5 ของ IMC Institiute
IT Trends eMagazine Vol 2. No.5 ของ IMC Institiute
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
เทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษาเทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษา
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 

Semelhante a Big Data Platform adopting Spark and Use Cases with Open Data

Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzerpriyal mistry
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleDatabricks
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 

Semelhante a Big Data Platform adopting Spark and Use Cases with Open Data (20)

Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 

Mais de Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in SeoulJongwook Woo
 

Mais de Jongwook Woo (7)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Último

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 

Último (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 

Big Data Platform adopting Spark and Use Cases with Open Data

  • 1. Jongwook Woo HiPIC CSULA Big Data Platform adopting Spark and Use Cases with Open Data Symposium on the High-Performance Big Data Analysis Platform 2016 Seoul, Korea April 28 2016 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Myself Name: 우종욱, Jongwook Woo Experience:  Since 2002, Professor at California State Univ Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Summer 2013 Igloo Security: – Collect, Search, and Analyze Security Log files 30GB – 100GB / day • Hadoop, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  • 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University  Grants  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 7. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  • 8. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  • 9. High Performance Information Computing Center Jongwook Woo CSULA What is Hadoop? 9  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Definition: Big Data Non-expensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers
  • 11. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  • 12. High Performance Information Computing Center Jongwook Woo CSULA 새로운 툴의 등장
  • 13. High Performance Information Computing Center Jongwook Woo CSULA 새로운 툴의 등장 나가시노 전투
  • 14. High Performance Information Computing Center Jongwook Woo CSULA 나가시노 전투
  • 15. High Performance Information Computing Center Jongwook Woo CSULA 나가시노 전투 3단 발사
  • 16. High Performance Information Computing Center Jongwook Woo CSULA Definition: Big Data 다시한번 빅데이터 데이터를 가지고 미래 가치를 예측하는것 – No! • 빅데이터의 한 응용사례, 우리가 늘 해오던 일일뿐 – 기존의 컴퓨터, DW, DB등으로 빅데이터는 하둡이라는 수퍼컴퓨터를 이용하려는 새로운 접근법
  • 17. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 18. High Performance Information Computing Center Jongwook Woo CSULA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce
  • 19. High Performance Information Computing Center Jongwook Woo CSULA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS  Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase… New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  • 20. High Performance Information Computing Center Jongwook Woo CSULA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Spark Drivers and Workers Drivers Client –with SparkContext • Communicate with Spark workers Workers Spark Executor Run on cluster nodes –Production Run in local threads –Development and Test
  • 22. High Performance Information Computing Center Jongwook Woo CSULA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects –that can be cached in memory Immutable –RDD, DStream, SchemaRDD, PairRDD Lineage –History of the objects –Automatically and efficiently re-compute lost data
  • 23. High Performance Information Computing Center Jongwook Woo CSULA RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  • 25. High Performance Information Computing Center Jongwook Woo CSULA Spark Spark SQL Turning an RDD into a Relation Querying using SQL Spark Streaming DStream – RDD in streaming – Windows • To select DStream from streaming data MLib Sparse vector support, Decision trees, Linear/Logistic Regression, PCA SVD and PCA
  • 26. High Performance Information Computing Center Jongwook Woo CSULA Scheduling Process ) rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects Optimizer Optimizer: build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  • 27. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 28. High Performance Information Computing Center Jongwook Woo CSULA Spark Spark File Systems: Tachyon Resource Manager: Mesos But, Hadoop has been dominating market Integrating Spark into Hadoop cluster Cloud Computing – Amazon AWS, Azure HDInsight, IBM Bluemix • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS – No Hadoop ecosystems
  • 29. High Performance Information Computing Center Jongwook Woo CSULA Block manager Task threads Spark Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark Driver/Client (app master) Spark worker(s) HDFS, HBase, Amazon S3, Couchbase, Cassandra, … RDD graph Scheduler Block tracker Block manager Task threads Shuffle tracker Cluster manager Block manager Task threads
  • 30. High Performance Information Computing Center Jongwook Woo CSULA Spark with Hadoop YARN Spark Client Slave Nodes  ResourceManager (RM) Per Cluster  Create Spark AM and  allocate Containers for Spark AM  NodeManager (NM) Per Node  Spark workers  ApplicationMaster (AM) Per Application  Containers for Spark Executors Master Node Node Manager Node Manager Node Manager Container: Spark Executor Spark AM Resource Manager
  • 31. High Performance Information Computing Center Jongwook Woo CSULA Databricks cluster at CalStateLA
  • 32. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Use Cases  Hadoop Spark Training
  • 33. High Performance Information Computing Center Jongwook Woo CSULA Open Data USA government Federal, State, City governments Expose data to public USA Business Twitter, Yelp, … Expose data to public with APIs – Some restriction to download City government New York – Taxi, Uber, … Los Angeles – Open Data, Open Hub with Geo info
  • 34. High Performance Information Computing Center Jongwook Woo CSULA Data from Industry: Twitter Data  Systems Azure HDInsights Spark 8 Nodes – 40 cores: 2.4GHz Intel Xeon – Memory - Each Node: 28 GB  Data Source Keyword ‘alphago’ from Tweeter via Apache NiFi  Data Size  63,193 tweets  Real Time Data Collection period 03/12 – 03/17/2016 – No data collected on 03/13
  • 35. High Performance Information Computing Center Jongwook Woo CSULA Top 10 Countries that Tweets “Alphago”
  • 36. High Performance Information Computing Center Jongwook Woo CSULA Top 10 Countries  # of Tweets per Country USA: > 11,000 Japan: > 9,000 Korea: > 1,900 Russia, UK: > 1,600 Thai Land, France : > 1,000  Netherland, Spain, Ukraine: > 600
  • 37. High Performance Information Computing Center Jongwook Woo CSULA Top 10 Countries Sentiment Positive Negative
  • 38. High Performance Information Computing Center Jongwook Woo CSULA Top 10 Countries Most Tweeted Countries  All countries show more positive tweets –Korea, Japan, USA Country Positive Negative USA 5070 3567 Japan 8118 217 … Korea 1053 407 …
  • 39. High Performance Information Computing Center Jongwook Woo CSULA Daily Tweets in 03/12 – 03/17/2016 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/2016 Alphago vs Lee Sedol Game 4: Mar 13 Lee Se-Dol win Game 5: Mar 15 Game 3: Mar 12
  • 40. High Performance Information Computing Center Jongwook Woo CSULA Ngram words  3 word in row right after Go-Champion “sedol” and “se-dol” sedol  se-dol 3-grams Frequency Again-to-win 1,187 Is-something-I’ll 369 Is-something-i 199 In-go-tournament 168
  • 41. High Performance Information Computing Center Jongwook Woo CSULA Sentiment Map of Alphago Positive Negative
  • 42. High Performance Information Computing Center Jongwook Woo CSULA Sentiment Map of Lee Se-Dol vs Alphago  YouTube video: “alphago sentiment” by Google  The sentiment of the World in Geo and Time: https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTb ToiB8wQ2w14a
  • 43. High Performance Information Computing Center Jongwook Woo CSULA Federal Government: Airline Data Set Government Open Data Airline Data Set in 2012 – 2014 – US Dept of transportation Cluster by Nillohit at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL  Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 7 GB – Windows Server 2012 R2 Datacenter
  • 44. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set
  • 45. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set
  • 46. High Performance Information Computing Center Jongwook Woo CSULA Airline Data Set
  • 47. High Performance Information Computing Center Jongwook Woo CSULA City Government: Crime Data Set Open Data in City of Los Angeles Crime Data Set in 2014 Ram Dharan and Sridhar Reddy at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4 – CPU: 4 Cores; MEMORY: 14 GB – Windows Server 2012 R2 Datacenter – Extending to last 10 years of data set
  • 48. High Performance Information Computing Center Jongwook Woo CSULA Crime Data Los Angeles 2014 2% 8% 9% 12% 17% 19% 33% Total occurences of each Crime CRIMINAL VANDALISM OTHERS BURGALARY ASSAULT TRAFFIC THEFT
  • 49. High Performance Information Computing Center Jongwook Woo CSULA Total No.of Crimes in 2014 19169 17384 19730 19413 20645 20494 21480 21280 21287 21669 19844 21355 0 5000 10000 15000 20000 25000 1 2 3 4 5 6 7 8 9 10 11 12 No.of Crimes per Month
  • 50. High Performance Information Computing Center Jongwook Woo CSULA Raw Data Projection on Map
  • 51. High Performance Information Computing Center Jongwook Woo CSULA Mapping of Crimes Occurred within 5miles from CSULA
  • 52. High Performance Information Computing Center Jongwook Woo CSULA Mapping of Crimes Occurred within 5miles from UCLA
  • 53. High Performance Information Computing Center Jongwook Woo CSULA Mapping of Crimes Occurred within 5miles from USC
  • 54. High Performance Information Computing Center Jongwook Woo CSULA No. of crimes within 5 miles from CSULA, UCLA and USC on crime type 0 5000 10000 15000 20000 25000 30000 csula ucla usc
  • 55. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) opened on May 7, 2014.
  • 56. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The station producing hydrogen for Hydrogen Vehicle Cal State L.A. Hydrogen Research and Fueling Facility the first station in the nation to sell hydrogen fuel to the public. Hyundai, Toyota
  • 57. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Workflow
  • 58. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Model by Manvi Chandra
  • 59. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations
  • 60. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations  Can predict Vehicle Pressure – Pressure of hydrogen gas within the vehicle Hydrogen Storage System – using our model in Azure Visual Studio ML – Building Spark ML Decision forest Regression – constructing a multitude of decision trees at training time • the mode of the classes (classification) • mean prediction (regression) of the individual trees.
  • 61. High Performance Information Computing Center Jongwook Woo CSULA Collaboration with City of Los Angeles Wellness and Safety Analysis How to improve wellness and the safety of the city –Expand Information Sharing and Performan ce Metrics –Promoting and improving City employee health and wellness. –Develop and carry out the City’s safety traini ng and injury prevention strategy.
  • 62. High Performance Information Computing Center Jongwook Woo CSULA Collaboration with City of Los Angeles (Cont’d) Procurement Analysis How to improve procurement of the city –Pricing trends –Supplier diversity –Cost Optimization –Invoicing/Billing/Payment Trends –Material Optimization –Resource/process efficiencies
  • 63. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Introduction To Spark  Spark and Hadoop  Open Data and Use Cases  Hadoop Spark Training
  • 64. High Performance Information Computing Center Jongwook Woo CSULA 광해군과 청
  • 65. High Performance Information Computing Center Jongwook Woo CSULA 사르후 전투 <만주실록의 사르후 전투 그림. 후금 vs 명군의 전투 장면
  • 66. High Performance Information Computing Center Jongwook Woo CSULA 강홍립과 부차 (후챠) 전투 <만주실록>: 조명연합군의 명 유정군 선봉을 공격하는 만주족 기병
  • 67. High Performance Information Computing Center Jongwook Woo CSULA 조선군 편성 조선측 사료 <충렬록 1770-1790> 정사4간본의 조선군 그림. 활을 든 사수와 조총을 든 포수
  • 68. High Performance Information Computing Center Jongwook Woo CSULA 강홍립과 부차 (후챠) 전투
  • 69. High Performance Information Computing Center Jongwook Woo CSULA 새로운 기술 개발 및 교육  하둡, 스파크 R&D및 가치 창출을 위한 새로운 수퍼컴퓨터
  • 70. High Performance Information Computing Center Jongwook Woo CSULA 하둡 스파크 교육이 왜 필요한가 새로운 가치 창조, R&D시 필요 미국을 필두로 공학, 과학, 기업등에서 하둡 스파크 빅데이터 교육의 중요성 인지 –데이타 마이닝 및 분석 분야 뿐아니라 대용량 데이터가 있는 모든 분야 중소기업도 Hadoop Cluster 소유가능 –저렴한 수퍼 컴퓨터 그러나, 아무도 하둡 스파크를 가르쳐 주지 않는다 누구에게 교육 받을 것인가?
  • 71. High Performance Information Computing Center Jongwook Woo CSULA 하둡 교육 어떻게 시작할 것인가? 기술자들의 Self-study 한계 시간상의 한계: more than a year to be an expert Don’t know the detail Miss many important topics 2014년 우리는 전문가, 국제경쟁 시대에 살고 있음 – 80년대 대학 강의실이 아님 교육비 절약? 기업 생산성 감소 Think USA! – Training, Training, Training…..
  • 72. High Performance Information Computing Center Jongwook Woo CSULA 하둡 교육 어떻게 시작할 것인가? (계속) IT분야의 각자교육의 한계 인식 필요 실리콘 밸리등 산업계에서 IT기술을 선도함 교육비 절약으로 빅데이터 산업에 뒤쳐짐 산업계 Training program 대한민국 특수성 – 정부 주도의 • 정부, 전문가, 기업을 통한 교육 제공 절실 • 공공기관, 학교는 정부주도로 • 산업체는 자생적으로 Theory Guy양성이 아닌 실무자 양성을 위한 실습용 장비/코드 예제 필요 저가 교육이 아닌 고가 양질 교육 육성 장려
  • 73. High Performance Information Computing Center Jongwook Woo CSULA Spark Training California State University Los Angeles (Prof Jongwook Woo) Supported by Databricks and its cloud computing services UC Berkeley Edx (MOOC) UC Berkeley amplab camp Stanford Cloudera, Hortonworks, DataStax Training courses IBM Big University
  • 74. High Performance Information Computing Center Jongwook Woo CSULA Databricks Partners
  • 75. High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  • 76. High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  • 77. High Performance Information Computing Center Jongwook Woo CSULA Big Data Forum at UKC 2016 Forum Chair: jwoo5@calstatela.edu Amazon AWS, Hortonworks, Couchbase, Qlik…
  • 78. High Performance Information Computing Center Jongwook Woo CSULA Question?
  • 79. High Performance Information Computing Center Jongwook Woo CSULA References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )  “Market Basket Analysis using Spark”, Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes
  • 80. High Performance Information Computing Center Jongwook Woo CSULA  Introduction to Big Data with Apache Spark, databricks  Stanford Spark Class (http://stanford.edu/~rezab )  Cornell University, CS5304  DS320: DataStax Enterprise Analytics with Spark  Cloudera, http://www.cloudera.com  Hortonworks, http://www.hortonworks.com  Spark 3 Use Cases, http://www.datanami.com/2014/03/06/apache_spark_ 3_real-world_use_cases/ References
  • 81. High Performance Information Computing Center Jongwook Woo CSULA Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Wide” (shuffle) deps: boundary of stages “Narrow” deps: A stage pipeline to be run on the same node
  • 82. High Performance Information Computing Center Jongwook Woo CSULA Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Narrow” deps: A stage pipeline to be run on the same node “Wide” (shuffle) deps: boundary of stages
  • 83. High Performance Information Computing Center Jongwook Woo CSULA Scheduler Optimizations Pipelines within a stage 2 map, union Stage 3: join algorithms based on partitioning (minimize shuffles) join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  • 84. High Performance Information Computing Center Jongwook Woo CSULA Scheduler Optimizations Conceptually Stage 1: 3 tasks Stage 2: 4 tasks Stage 3: 3 tasks Total: 3 stages, 10 tasks join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task