SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Jongwook Woo
HiPIC
CalStateLA
Seoul Elasticsearch Community Meetup
Gangnam, Korea
Aug 10 2017
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Data Collection and
Visualization using Big Data:
President Election 2017 in
Korea
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Architecture
 Demo
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Myself
Experience:
 Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since Jan 2016 : Co-Founder of The Big Link LLC and Wiken
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Experience (Cont’d): Bring in Big Data R&D and training to
Korea since 2009
Collaborating with LA city since 2016
– Collect, Search, and Analyze City Data
• Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and Research Centers
• Yonsei, Gachon, DongEui
• US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana
State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Experience in Big Data
 Collaboration
 Council Member of IBM Spark Technology Center
 City of Los Angeles for OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
– The Big Link, Softzen, Wiken in Korea
 Grants and Awards
 Faculty Scholarship Winner of Teradata University Network 2017
 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS,
Teradata
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
Architecture
 Demo
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that is distributed parallel systems
and that can store a large scale data and process it in parallel [1,
2]
Hadoop and Spark
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others
– Cloud Computing Big Data services
• Amazon AWS, IBM Bluemix, Microsoft Azure
– NoSQL DB (Cassandra, MongoDB, Redis, HBase)
– ElasticSearch
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
 Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase…
New Programming with faster data sharing
Good
– Iterative graph algorithms, Machine Learning
Interactive query
High Performance Information Computing Center
Jongwook Woo
CalStateLA
ElasticSearch
Full Text Search and Visualization Server
Getting more popular than Solr
ElasticSearch, Kibana, ES-Hadoop, Logstash,…
Based on Apache Lucene library
Horizontally Scalable
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Elastic Stack
100% open source
No enterprise edition
All new versions with 5.0
ElasticSearch
High Performance Information Computing Center
Jongwook Woo
CalStateLA 12
ES-Hadoop
Elasticsearch for
Hadoop
• Exchange data between Hadoop HDFS and ElasticSearch
ElasticSearch
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Architecture
 Demo
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Big Data Analysis Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Datameer, Qlik, Tableau,…)
Data Visualization
Qlik, Datameer, Excel
PowerView
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Data Engineering
Data Source
Twitter streaming API
– using the keywords
• "문재인","moonriver365", "안철수", "cheolsoo0919", "유승민", "yooseongmin2017",
"홍준표", "HongSkyangel808", "심상정", "sangjungsim“
– Roughly: April 28 2017 – May 11 2017
Data Collection
Apache Nifi for streaming data
– supports powerful and scalable directed graphs
• data routing, transformation, and system mediation logic
Data Storage
ElasticSearch
Hadoop HDFS at Azure
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Data Engineering (Cont’d)
Data Analysis and Prediction: In the future
Spark ML, Spark SQL, Hadoop Hive
Data Visualization
Kibana in ElasticSearch
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Apache NiFi
• NiFi-1.1.2: getTwitter, putElasticSearch5, putHDFS
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Hadoop Spark Cluster: HDInsight in Azure
vCores Memory Local SSD
(GB) (GB)
4 28 200
High Performance Information Computing Center
Jongwook Woo
CalStateLA
ElasticSearch in HDInsights
Did not launch ElasticSearch Service in Azure
Instead, install ES5 in Linux Head Node of HDInsights
cluster
–ElasticSearch
• 5.3.1
–Kibana
• 5.3.2
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Mapping to ES
Temp-Spatial Analysis
 For matching the Twitter date format to ES
curl -XPUT localhost:9200/_template/elect17 -d '
{
"template" : "elect17*",
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"default" : {
"properties" : {
"created_at" : {
"type" : "date",
"format" : "EEE MMM dd HH:mm:ss Z YYYY"
},
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Mapping to ES (Cont’d)
"coordinates" : {
"properties" : {
"coordinates" : {
"type" : "geo_point"
},
"type" : {
"type" : "string"
}
}
},
"user" : {
"properties" : {
"screen_name" : {
"type" : "string",
"index" : "not_analyzed"
},
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Mapping to ES (Cont’d)
"lang" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
}
}
}
}'
High Performance Information Computing Center
Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)
High Performance Information Computing Center
Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)
High Performance Information Computing Center
Jongwook Woo
CalStateLA
ES-Hadoop
 Install ES-Hadoop
$ wget -P /tmp http://download.elastic.co/hadoop/elasticsearch-
hadoop-5.3.1.zip
$ unzip /tmp/elasticsearch-hadoop-5.3.1.zip -d /tmp
$ cp /tmp/elasticsearch-hadoop-5.3.1/dist/elasticsearch-hadoop-
5.3.1.jar /tmp/elasticsearch-hadoop-5.3.1.jar
$ hdfs dfs -copyFromLocal /tmp/elasticsearch-hadoop-
5.3.1/dist/elasticsearch-hadoop-5.3.1.jar /tmp
$ sudo cp elasticsearch-spark-20_2.11-5.3.1.jar
/usr/hdp/current/spark2-client/
High Performance Information Computing Center
Jongwook Woo
CalStateLA
ES-Hadoop (Cont’d)
 Add ES-Hadoop libraries to Hive with one of the
followings:
$ hive
hive> add jar hdfs:///tmp/elasticsearch-hadoop-5.3.1.jar
hive> add jar /tmp/elasticsearch-hadoop-5.3.1.jar
hive> add jar file:///tmp/elasticsearch-hadoop-5.3.1.jar
hive > list jar ;
file:///tmp/elasticsearch-hadoop-5.3.1.jar
High Performance Information Computing Center
Jongwook Woo
CalStateLA
ES-Hadoop (Cont’d)
hive> select * from elect17_test LIMIT 10;
OK
856281525070909440 NULL NULL NULL NULL RT @sydbris:
이 정도는 우리 문재인 후보님이 절대 말씀하시지 않겠지.
"넌 내가 유신 반대투쟁하고 민주화운동할 때 친구들이랑 고대 앞
하숙방에 모여서 xx 모의했냐?" Sun Apr 23 22:59:59 +0000 2017
856281524995407872 NULL NULL NULL NULL RT
@choomiae: 존경하는 시흥시민 여러분!
…
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Architecture
 Demo
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Demo
Azure Portal
Ubuntu VM
ElasticSearch
NiFi
Kibana: April 29 – May 10
Hive with ES-Hadoop
Test with the data on April 23 – April 24
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Spark Big Data Training and R&D
HiPIC
California State University Los Angeles
Supported by
– Databricks and its cloud computing services
– Amazon AWS, IBM Buemix, MS Azure
– Hortonworks, Cloudera
– Teradata
– ElasticSearch
– Qlik, Tableau
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Databricks Partners
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Training Hadoop and Spark
Cloudera visits to interview Jongwook Woo
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Training Hadoop on IBM Bluemix at
California State Univ. Los Angeles
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Conclusion
K-Elect 2017 in ES5 and HDInsights
ES5
Easy to collect and visualize
HDInsights
Data and Predict Analysis possible
High Performance Information Computing Center
Jongwook Woo
CalStateLA
Question?
High Performance Information Computing Center
Jongwook Woo
CalStateLA
References
1. “Market Basket Analysis Algorithm with Map/Reduce of
Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011
international Conference on Parallel and Distributed
Processing Techniques and Applications (PDPTA 2011), Las
Vegas (July 18-21, 2011)
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis
Algorithms with MapReduce”, Wiley Interdisciplinary
Reviews Data Mining and Knowledge Discovery, Oct 28 2013,
Volume 3, Issue 6, pp445-452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016,
Dallas, TX, Aug 12 2016
High Performance Information Computing Center
Jongwook Woo
CalStateLA
4. Business Data Analysis LA at Databricks, HiPIC of CalStateLA, Jongwook
Woo https://docs.databricks.com/spark/latest/training/cal-state-la-
biz-data-la.html
5. https://github.com/hipic/spark_mba, HiPIC of California State
University Los Angeles
6. Hadoop, http://hadoop.apache.org
7. Databricks, http://www.databricks.com
8. DS320: DataStax Enterprise Analytics with Spark
9. Cloudera, http://www.cloudera.com
10.Hortonworks, http://www.hortonworks.com
References (Cont’d)

Mais conteúdo relacionado

Mais procurados

Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12mark madsen
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, LarusNeo4j
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Data Analytics and Artificial Intelligence in the era of Digital Transformation
Data Analytics and Artificial Intelligence in the era of Digital TransformationData Analytics and Artificial Intelligence in the era of Digital Transformation
Data Analytics and Artificial Intelligence in the era of Digital TransformationJan Wiegelmann
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierDemai Ni
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkJongwook Woo
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Wes McKinney
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to sparksteccami
 

Mais procurados (20)

Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, Larus
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Data Analytics and Artificial Intelligence in the era of Digital Transformation
Data Analytics and Artificial Intelligence in the era of Digital TransformationData Analytics and Artificial Intelligence in the era of Digital Transformation
Data Analytics and Artificial Intelligence in the era of Digital Transformation
 
Bigdata
BigdataBigdata
Bigdata
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to spark
 

Semelhante a President Election of Korea in 2017

Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open DataJongwook Woo
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzerpriyal mistry
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 

Semelhante a President Election of Korea in 2017 (20)

AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 

Mais de Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in SeoulJongwook Woo
 

Mais de Jongwook Woo (10)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Último

DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligencePriyadharshiniG41
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Último (20)

DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligence
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

President Election of Korea in 2017

  • 1. Jongwook Woo HiPIC CalStateLA Seoul Elasticsearch Community Meetup Gangnam, Korea Aug 10 2017 Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles Data Collection and Visualization using Big Data: President Election 2017 in Korea
  • 2. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Architecture  Demo
  • 3. High Performance Information Computing Center Jongwook Woo CalStateLA Myself Experience:  Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since Jan 2016 : Co-Founder of The Big Link LLC and Wiken  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  • 4. High Performance Information Computing Center Jongwook Woo CalStateLA Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city since 2016 – Collect, Search, and Analyze City Data • Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 – Introduce Hadoop Big Data and education to Univ and Research Centers • Yonsei, Gachon, DongEui • US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB • Europe: Univ of Luxembourg Myself
  • 5. High Performance Information Computing Center Jongwook Woo CalStateLA Experience in Big Data  Collaboration  Council Member of IBM Spark Technology Center  City of Los Angeles for OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University – The Big Link, Softzen, Wiken in Korea  Grants and Awards  Faculty Scholarship Winner of Teradata University Network 2017  IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  • 6. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data Architecture  Demo
  • 7. High Performance Information Computing Center Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  • 8. High Performance Information Computing Center Jongwook Woo CalStateLA Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop and Spark – Non-expensive Super Computer – More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers Others – Cloud Computing Big Data services • Amazon AWS, IBM Bluemix, Microsoft Azure – NoSQL DB (Cassandra, MongoDB, Redis, HBase) – ElasticSearch
  • 9. High Performance Information Computing Center Jongwook Woo CalStateLA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS  Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase… New Programming with faster data sharing Good – Iterative graph algorithms, Machine Learning Interactive query
  • 10. High Performance Information Computing Center Jongwook Woo CalStateLA ElasticSearch Full Text Search and Visualization Server Getting more popular than Solr ElasticSearch, Kibana, ES-Hadoop, Logstash,… Based on Apache Lucene library Horizontally Scalable
  • 11. High Performance Information Computing Center Jongwook Woo CalStateLA Elastic Stack 100% open source No enterprise edition All new versions with 5.0 ElasticSearch
  • 12. High Performance Information Computing Center Jongwook Woo CalStateLA 12 ES-Hadoop Elasticsearch for Hadoop • Exchange data between Hadoop HDFS and ElasticSearch ElasticSearch
  • 13. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Architecture  Demo
  • 14. High Performance Information Computing Center Jongwook Woo CalStateLA Big Data Analysis Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, Tableau,…) Data Visualization Qlik, Datameer, Excel PowerView
  • 15. High Performance Information Computing Center Jongwook Woo CalStateLA Data Engineering Data Source Twitter streaming API – using the keywords • "문재인","moonriver365", "안철수", "cheolsoo0919", "유승민", "yooseongmin2017", "홍준표", "HongSkyangel808", "심상정", "sangjungsim“ – Roughly: April 28 2017 – May 11 2017 Data Collection Apache Nifi for streaming data – supports powerful and scalable directed graphs • data routing, transformation, and system mediation logic Data Storage ElasticSearch Hadoop HDFS at Azure
  • 16. High Performance Information Computing Center Jongwook Woo CalStateLA Data Engineering (Cont’d) Data Analysis and Prediction: In the future Spark ML, Spark SQL, Hadoop Hive Data Visualization Kibana in ElasticSearch
  • 17. High Performance Information Computing Center Jongwook Woo CalStateLA Apache NiFi • NiFi-1.1.2: getTwitter, putElasticSearch5, putHDFS
  • 18. High Performance Information Computing Center Jongwook Woo CalStateLA Hadoop Spark Cluster: HDInsight in Azure vCores Memory Local SSD (GB) (GB) 4 28 200
  • 19. High Performance Information Computing Center Jongwook Woo CalStateLA ElasticSearch in HDInsights Did not launch ElasticSearch Service in Azure Instead, install ES5 in Linux Head Node of HDInsights cluster –ElasticSearch • 5.3.1 –Kibana • 5.3.2
  • 20. High Performance Information Computing Center Jongwook Woo CalStateLA Mapping to ES Temp-Spatial Analysis  For matching the Twitter date format to ES curl -XPUT localhost:9200/_template/elect17 -d ' { "template" : "elect17*", "settings" : { "number_of_shards" : 1 }, "mappings" : { "default" : { "properties" : { "created_at" : { "type" : "date", "format" : "EEE MMM dd HH:mm:ss Z YYYY" },
  • 21. High Performance Information Computing Center Jongwook Woo CalStateLA Mapping to ES (Cont’d) "coordinates" : { "properties" : { "coordinates" : { "type" : "geo_point" }, "type" : { "type" : "string" } } }, "user" : { "properties" : { "screen_name" : { "type" : "string", "index" : "not_analyzed" },
  • 22. High Performance Information Computing Center Jongwook Woo CalStateLA Mapping to ES (Cont’d) "lang" : { "type" : "string", "index" : "not_analyzed" } } } } } } }'
  • 23. High Performance Information Computing Center Jongwook Woo CalStateLA K-Election 2017 (April 29 – May 9)
  • 24. High Performance Information Computing Center Jongwook Woo CalStateLA K-Election 2017 (April 29 – May 9)
  • 25. High Performance Information Computing Center Jongwook Woo CalStateLA ES-Hadoop  Install ES-Hadoop $ wget -P /tmp http://download.elastic.co/hadoop/elasticsearch- hadoop-5.3.1.zip $ unzip /tmp/elasticsearch-hadoop-5.3.1.zip -d /tmp $ cp /tmp/elasticsearch-hadoop-5.3.1/dist/elasticsearch-hadoop- 5.3.1.jar /tmp/elasticsearch-hadoop-5.3.1.jar $ hdfs dfs -copyFromLocal /tmp/elasticsearch-hadoop- 5.3.1/dist/elasticsearch-hadoop-5.3.1.jar /tmp $ sudo cp elasticsearch-spark-20_2.11-5.3.1.jar /usr/hdp/current/spark2-client/
  • 26. High Performance Information Computing Center Jongwook Woo CalStateLA ES-Hadoop (Cont’d)  Add ES-Hadoop libraries to Hive with one of the followings: $ hive hive> add jar hdfs:///tmp/elasticsearch-hadoop-5.3.1.jar hive> add jar /tmp/elasticsearch-hadoop-5.3.1.jar hive> add jar file:///tmp/elasticsearch-hadoop-5.3.1.jar hive > list jar ; file:///tmp/elasticsearch-hadoop-5.3.1.jar
  • 27. High Performance Information Computing Center Jongwook Woo CalStateLA ES-Hadoop (Cont’d) hive> select * from elect17_test LIMIT 10; OK 856281525070909440 NULL NULL NULL NULL RT @sydbris: 이 정도는 우리 문재인 후보님이 절대 말씀하시지 않겠지. "넌 내가 유신 반대투쟁하고 민주화운동할 때 친구들이랑 고대 앞 하숙방에 모여서 xx 모의했냐?" Sun Apr 23 22:59:59 +0000 2017 856281524995407872 NULL NULL NULL NULL RT @choomiae: 존경하는 시흥시민 여러분! …
  • 28. High Performance Information Computing Center Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Architecture  Demo
  • 29. High Performance Information Computing Center Jongwook Woo CalStateLA Demo Azure Portal Ubuntu VM ElasticSearch NiFi Kibana: April 29 – May 10 Hive with ES-Hadoop Test with the data on April 23 – April 24
  • 30. High Performance Information Computing Center Jongwook Woo CalStateLA Spark Big Data Training and R&D HiPIC California State University Los Angeles Supported by – Databricks and its cloud computing services – Amazon AWS, IBM Buemix, MS Azure – Hortonworks, Cloudera – Teradata – ElasticSearch – Qlik, Tableau
  • 31. High Performance Information Computing Center Jongwook Woo CalStateLA Databricks Partners
  • 32. High Performance Information Computing Center Jongwook Woo CalStateLA Training Hadoop and Spark Cloudera visits to interview Jongwook Woo
  • 33. High Performance Information Computing Center Jongwook Woo CalStateLA Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
  • 34. High Performance Information Computing Center Jongwook Woo CalStateLA Conclusion K-Elect 2017 in ES5 and HDInsights ES5 Easy to collect and visualize HDInsights Data and Predict Analysis possible
  • 35. High Performance Information Computing Center Jongwook Woo CalStateLA Question?
  • 36. High Performance Information Computing Center Jongwook Woo CalStateLA References 1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
  • 37. High Performance Information Computing Center Jongwook Woo CalStateLA 4. Business Data Analysis LA at Databricks, HiPIC of CalStateLA, Jongwook Woo https://docs.databricks.com/spark/latest/training/cal-state-la- biz-data-la.html 5. https://github.com/hipic/spark_mba, HiPIC of California State University Los Angeles 6. Hadoop, http://hadoop.apache.org 7. Databricks, http://www.databricks.com 8. DS320: DataStax Enterprise Analytics with Spark 9. Cloudera, http://www.cloudera.com 10.Hortonworks, http://www.hortonworks.com References (Cont’d)