Big Data on Networks with Hadoop and its ecosystems (Giraph, Flume,...) at Korea Institute of Science and Technology Information. Illustrates some possible approach on Networks
1. jwoo Woo
HiPIC
CSULA
Big Data and Data Intensive Computing
on Networks
KISTI
Dae-Jeon, Korea
Sept 23rd 2013
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles
2. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
Emerging Big Data Technology
Big Data Use Cases on Networks
Training in Big Data
Big Data Supporters
Hadoop 2.0
3. High Performance Information Computing Center
Jongwook Woo
CSULA
Me
이름: 우종욱
직업:
교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
경력:
2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
2009여년 부터 하둡 빅데이타에 관심
4. High Performance Information Computing Center
Jongwook Woo
CSULA
Me
경력 (계속):
2013년 여름 현재 IglooSecurity 자문중:
– Hadoop 및 그 Ecosystems 교육
– 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을
빠르게 데이타 검색하는 시스템 R&D
• Hadoop, Solr, Java, Cloudera 이용
2013년 9월 중순: 삼성 종합 기술원
– 3일간 Hadoop 및 그 Ecosystems 교육 예정
– Introducing Cloudera material to Samsung, Korea
5. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Grants
Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
Partnership
Received Academic Education Partnership with Cloudera since
June 2012
Linked with Hortonworks since May 2013
– Positive to provide partnership
6. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Certificate
Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
Blog and Github for Hadoop and its ecosystems
http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
https://github.com/dalgual
7. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Several publications regarding Hadoop and NoSQL
“Scalable, Incremental Learning with MapReduce
Parallelization for Cell Detection in High-Resolution 3D
Microscopy Data”. Chul Sung, Jongwook Woo, Matthew
Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings
of the International Joint Conference on Neural Networks, 2013
“Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA
2012, Las Vegas (July 16-19, 2012)
“Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon
Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011
“Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011,
Las Vegas (July 18-21, 2011)
Collaboration with Universities and companies
USC, Texas A&M, Yonsei, Sookmyung, KAIST, Korean Polytech Univ
Cloudera, Hortonworks, VanillaBreeze, IglooSecurity,
8. High Performance Information Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
9. High Performance Information Computing Center
Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”
10. High Performance Information Computing Center
Jongwook Woo
CSULA
Emerging Big Data Technology
Giraph
Flume
Use Cases experienced
11. High Performance Information Computing Center
Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data
12. High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
13. High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
NoSQL DB
How to compute Big Data
Parallel Computing with multiple non-
expensive computers
–Own super computers
14. High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data Market
Big Data Market in the world
$16.9 Billion in 2015 by IDC
$53.4 Billion in 2017 by Wikibon
Big Data Market in Korea
Korea Information Society Development Institute
– $263 Million in 2015
– $853 Million in 2020
Big Data in Information Communication Technology
– 0.6% in 2013
– 2.3 % in 2020
15. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
MapReduce
HDFS
Restricted Parallel Programming
– Not for iterative algorithms
– Not for graph
Illustrate it with Ch3
16. High Performance Information Computing Center
Jongwook Woo
CSULA
Network Topology for Hadoop 1.0
Big Data Network Design Consideration by CISCO
(http://www.cisco.com/en/US/prod/collateral/switches/ps9
441/ps9670/white_paper_c11-690561.html)
17. High Performance Information Computing Center
Jongwook Woo
CSULA
Giraph
BSP
Facebook
http://www.slideshare.net/aladagemre/a-talk-
on-apache-giraph
18. High Performance Information Computing Center
Jongwook Woo
CSULA
Flume
Flume
Real-time data migration to Hadoop
Cloudera material
19. High Performance Information Computing Center
Jongwook Woo
CSULA
Security Issues in Big Data
Can collect data from Social Networks
Each data does not mean anything
Data collected and related become meaning
– Using Big Data to analyze data by hacker
Big Data Analysis can be a shield too
While it can be used by hackers
21. High Performance Information Computing Center
Jongwook Woo
CSULA
APT
APT (Advanced Persistent Threat)
Select one target
–Gov, Bank
–By expert group – terrorist, hackers
Collect and analyze data from the site
Use the latest hacking technology
22. High Performance Information Computing Center
Jongwook Woo
CSULA
BYOD
BYOD (Bring Your Own Device)
Personal Device for Biz
–Efficient
–Connect to the internal Data and network
But Not secure
–Lost the device
–Exposed to open network out of office
–Hacking the personal device to hack in the
network
23. High Performance Information Computing Center
Jongwook Woo
CSULA
Possible Solutions
BYOD
Hypervisors
–Two OSs for a device
• Private and Biz
Containerization
–Two Data for an application
• Private and Biz
24. High Performance Information Computing Center
Jongwook Woo
CSULA
Possible Solutions
Security Intelligence (SI)
Analyze IPS/IDS and Security events
3 Steps
– Data Collection
• Log Data, Event Data
– Data Analyzing
• Pattern Analysis, Relationship among data
–Finding Solutions or Fixing the problems
• Build Regulations
Using Big Data for SI
25. High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases experienced
Log Analysis at IglooSecurity Inc
Log files from IPS and IDS
–1.5GB per day for each systems
Extracting unusual cases using Hadoop,
Solr, Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
Machine Learning for Image
Processing with Texas A&M
Hadoop Streaming API
26. High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases in Korea
SK Telecomm
Seoul
Credit Cards
Hyundai Motors
27. High Performance Information Computing Center
Jongwook Woo
CSULA
SK Telecomm
T Map
Collect GPS traffic data from Taxi, Bus,
Rental Car
– Every 5 mins. Traffic data from 50,000 cars
Tell the quickest directions to the
destination
28. High Performance Information Computing Center
Jongwook Woo
CSULA
Seoul
Night Bus
Collect GPS traffic data from Taxi
Find out the most frequent traffics
–Build Bus lines in the night
29. High Performance Information Computing Center
Jongwook Woo
CSULA
Credit Cards
Apps to find out popular restaurants
Collect customers behavior, which occurred using
the cards at the restaurants
Based on Logic: Frequency to visit the same
restaurants in 3 months
Show the popular restaurants
Credit Cards for Gas Station discount
Using a card at a gas station that does not provide
discounts
Sell a new card that gives a discount at any station
30. High Performance Information Computing Center
Jongwook Woo
CSULA
Hyundai Motors
Improve the present and future models
Collect drivers’ behavior and the status of the cars
Collect any errors in the car
31. High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases
President Election
Amazon AWS
HuffPOst | AOL
Netflix
32. High Performance Information Computing Center
Jongwook Woo
CSULA
President Election
People Behavior Analysis
Collect people’s data of Credit card usages, Car
models, Newspapers to read, Facebook, Twitter
For example, pro-environmental Campaign for
– Mom
• who sends the kids to the public school,
• who twits about Organic foods,
33. High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comment Moderation
–Evaluate All New HuffPost User Comments
Every Day
• Identify Abusive / Aggressive Comments
• Auto Delete / Publish ~25% Comments Every
Day
Article Classification
–Tag Articles for Advertising
• E.g.: scary, salacious, …
34. High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– Mahout, a parallel machine learning tool, is
already available.
– There are Mallet, libsvm, Weka, … that support
necessary algorithms.
Bad news:
– Mahout doesn’t support necessary algorithms
yet.
– Other algorithms do not run natively on Hadoop.
build a flexible ML platform running on
Hadoop
Pig for Hadoop implementation.
35. High Performance Information Computing Center
Jongwook Woo
CSULA
Netflix
Biggest Video Streaming company
Dominate Movie Video industry
Using Amazon AWS
Customer Behavior Analysis
Recommendation Systems
Event to find out the fastest customer recommendation
MR algorithm
36. High Performance Information Computing Center
Jongwook Woo
CSULA
Others
amazon.com
Recommend books to the people
Google
Find out influenza much earlier
– by analyzing the area under influenza
Translator
– by analyzing the data from many people
Siri of Apple
Natural Language Processing from many data of
people
37. High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Ecosystems
Self-study
Are you sure if you know the detail?
– Sqoop, Hive, Pig, Combiner, Partitioner, Setting # of
Reducers, …
Training program
Cloudera, Hortonworks
– $2,500, Hands-on Exercises
– About Hadoop, Hbase, Hive/Pig, Data Analysis, Data
Mining etc
Educational Partnership with Cloudera
– Training ppl at Samsung using Cloudera’s material
Educational Partnership with Hortonworks
– Invited to train ppl at Big Data center of Gyung-gi province
using Hortonworks’ material
38. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services
Online Serving – HOYA (HBase on YARN)
Real-time event processing – Storm, S4, other
commercial platforms
Tez – Generic framework to run a complex DAG
MPI: OpenMPI, MPICH2
Master-Worker
Machine Learning: Spark
Graph processing: Giraph
Enabled by allowing the use of paradigm-specific
application master
[http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex]
39. High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data Supporters
Amazon AWS
Facebook
Twitter
Craiglist
40. High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws.amazon.com
IT infrastructure business
– Focus on your business not IT management
Pay as you go
Services with many APIs
– S3: Simple Storage Service
– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers
• Can run on multiple nodes
– Hadoop and HBase
– MongoDB
41. High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung
– Smart TV hub sites: TV applications are on AWS
Netflix
– ~25% of US internet traffic
– ~100% on AWS
NASA JPL
– Analyze more than 200,000 images
NASDAQ
– Using AWS S3
HiPIC received research and teaching
grants from AWS
42. High Performance Information Computing Center
Jongwook Woo
CSULA
Facebook [7]
Using Apache HBase
For Titan and Puma
– Message Services
– ETL
HBase for FB
– Provide excellent write performance and good reads
– Nice features
• Scalable
• Fault Tolerance
• MapReduce
43. High Performance Information Computing Center
Jongwook Woo
CSULA
Titan: Facebook
Message services in FB
Hundreds of millions of active users
15+ billion messages a month
50K instant message a second
Challenges
High write throughput
– Every message, instant message, SMS, email
Massive Clusters
– Must be easily scalable
Solution
Clustered HBase
44. High Performance Information Computing Center
Jongwook Woo
CSULA
Puma: Facebook
ETL
Extract, Transform, Load
– Data Integrating from many data sources to Data Warehouse
Data analytics
– Domain owners’ web analytics for Ad and apps
• clicks, likes, shares, comments etc
ETL before Puma
8 – 24 hours
– Procedures: Scribe, HDFS, Hive, MySQL
ETL after Puma
Puma
– Real time MapReduce framework
2 – 30 secs
– Procedures: Scribe, HDFS, Puma, HBase
45. High Performance Information Computing Center
Jongwook Woo
CSULA
Twitter [8]
Three Challenges
Collecting Data
– Scribe as FB
Large Scale Storage and analysis
– Cassandra: ColumnFamily key-value store
– Hadoop
Rapid Learning over Big Data
– Pig
• 5% of Java code
• 5% of dev time
• Within 20% of running time
46. High Performance Information Computing Center
Jongwook Woo
CSULA
Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwide
~1 billion hits/day
~1.5 million posts/day
Servers
– ~500 servers
– ~100 MySQL servers
Migrate to MongoDB
Scalable, Fast, Proven, Friendly
47. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
Hadoop MapReduce for Non-Java codes: Python,
Ruby
Requirement
Running Hadoop
Needs Hadoop Streaming API
– hadoop-streaming.jar
Needs to build Mapper and Reducer codes
– Simple conversion from sequential codes
STDIN > mapper > reducer > STDOUT
48. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
MapReduce Python execution
http://wiki.apache.org/hadoop/HadoopStreaming
Sysntax
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options] Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
Example
$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar
-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py
-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py
-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-
output
49. High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Many solutions but Hadoop
Storage: NoSQL DB
Computation: Hadoop MapRedude
Need to analyze Big Data in mobile computing, SNS
for Ad, User Behavior, Patterns …
Emerging Technology
Hadoop 2.0
Training is important