8. Motivated Example
20 billion web pages * 20KB per web
page
◦ About 400TB
A computer reads 30-35MB/sec from
disk
◦ It takes about 4 months to read
8
10. Challenges
How do we distribute computation?
How can we make it easy to write
parallel programs?
How can we handle machine failure?
10
11. Approach
Distributed File Systems
◦ Google File System, Hadoop Distributed
File System
Parallel Programming Framework
◦ MapReduce in Hadoop
11
12. Distributed File System
Problem
◦ If nodes fail, how to store data persistently?
Solution
◦ Hadoop Distributed File System (HDFS),
providing global file name space
Properties of Data for HDFS
◦ Huge files ~ xx TB
◦ Data is rarely updated in place (i.e.,
immutable files)
◦ Read/append operations are common
12
13. Distributed File System
Name Node
◦ Store meta data
◦ Active/Stand by
Data Nodes
◦ File is split into contiguous chunks
◦ Each chunk ~ 64MB
◦ Each chunk is replicated ~ 3x
◦ Keep replicas in different racks
Client (to access files)
◦ Contact the name node to find data nodes
◦ Directly connect to data nodes to access
data
13
15. Overview
Sequentially read big data
Map
◦ Extract something you care about
Group by key
◦ Sort and shuffle
Reduce
◦ Aggregate, summarize, filter, transform
Write the result
15
18. Algorithm
Input
◦ A set of key/value pairs
A programmer specifies two methods
◦ Map(k,v) => <k’,v’>*
Takes a key value pair and outputs a set of key
value pairs
Ex) key is the filename & value is a single line
in the file
◦ Reduce(k’,<v’>*) => <k’,v’’>*
All values v’ with the same key k’ are reduced
together and processed in v’ order
18
20. Word Counting using
MapReduce
Map(key, value)
//key: document name; value: text of the
document
for each word w in value:
emit(w,1)
Reduce(key, values)
//key: a word; value: an iterator over counts
Result = 0
for each count v in values
result += v
emit(key, result)
20
21. MapReduce Task
Partition the input data
Schedule the program execution
across nodes
Handle machine failures
Managing inter communication
between nodes
21
23. Name Node
Task status
◦ Idle, in-progress, completed
Idle tasks are get scheduled when workers
are available
When a map task finishes, it sends to name
node the location and sizes of its
intermediate files, one for each reduce
worker
Name node pushes this information to
reducers
Name node regularly pings workers to detect
failures
23
24. Failover
Map worker failure
◦ Map tasks are completed or in-progress at
worker are reset to idle
◦ Reduce workers are notified when task is
rescheduled on another worker
Reduce worker failure
◦ Only in-progress tasks are reset to idle
Master failure
◦ MapReduce task is aborted and client is
notified
24
25. Set-up
Map tasks: M
Reduce tasks: R
M, R are much larger than # of nodes
in cluster
One chunk data per map is common
Improve dynamic load balancing
Speed recovery from worker failure
Often R is smaller than M
◦ Note that output is spread across R files
25
26. Combiners
A map task will produce many pairs of
(k,v1), (k,v2), … for the same key k
Popular words in word count
Pre aggregate values at the mapper
◦ Combine(k, list(v1)) -> v2
◦ Combiner is the same as the reduce
function
26
31. Mixed Entities in Web
The search result includes a
mixture of web pages with
different Tom Mitchells
Separate different web pages
into different groups (called
clusters)
Byung-Won On, Ingyu Lee and Dongwon Lee, Scalable
clustering methods for the name disambiguation problem,
Knowledge and Information Systems 31(1):129-151 (2012)
31
32. Clustering
Web pages of two different persons with the same name spellings are all mixed in the pool
a2
a2 a1
a3
a1
a3
b1
b1
b2
b2
32
33. k-means
w4
w2
1. Random selection of w5
w3
cluster centroid w1
w4 w4
w2 w2
2. Measure distance w5 w5
w3 w3
between centroid and w1
w1
object From w4
From w3
3. Assign each object to
each centroid based on w2
w4
short distance w5
w3
w1
4. Choose new centroid
in each cluster based on
w4
mean in each cluster w2
w5
w3
5. Repeat step 2 to step w1
4 until convergence
criterion is met.
33
34. k-means using MapReduce
Do
◦ Map
Input is a data point and k centers are
broadcasted
Find the closest center among k centers for the
input point
◦ Reduce
Input is one of k centers and all data points
having this center as their closest center
Calculate the new center using data points
Until all of new centers are not
changed 34
38. Relational DB
Manage data in GB or TB
Store important data
(transaction, personnel, …)
Guarantee both consistency and
availability
Oracle, DB2, MS SQL Server, MySQL
38
39. Not Only SQL (NoSQL)
Manage unstructured data such as text
in TB or PB
Guarantee partition tolerance
Guarantee either consistency or
availability
Flexible Schema
No SQL and join operations
Big Table (Google), Dynamo (Amazon),
Hbase (Yahoo), Cassandra (Facebook),
MongoDB, Neptune (NHN)
◦ Big Table = Hbase = Neptune
39
40. Neptune: For Managing Big
Data
Analyzing log data from Internet portal
or online game service
Calculating PageRank or similarities
between web pages
Search for personalization
Social network analysis, recommender
systems, blog clustering, etc.
40
42. Component Nodes
Master Node
◦ Assign tablets to TabletServers, considering the
number and size of tablets
TabletServers
◦ Provide insertion/deletion with clients
◦ Store a few thousands tablets (100~200MB per
tablet) => a few hundreds GB
◦ In-Memory & Disk DB
◦ Merge tablets if # of files is increasing (Improvement
of search operation)
◦ Split tablets if the size of a file is large
◦ (Improvement of performance)
Changelog Servers
◦ Store transaction log
42
43. Data Format
Logical data unit
◦ Table
Row
◦ Rowkey created by
systems automatically
◦ Sorted in ascending order
Column
◦ Column key, timestamp
◦ Sorted in lexical order
◦ Get operation
Return a recent data
◦ Column oriented indexing
Divide a table into a set
of tablets by rowkey
Store tablets in cluster
43
45. Real Time Processing
Reasonable
performance
◦ A few ms response
time
In-Memory DB
Minor compaction
◦ When a memory table
is full
Major compaction
◦ Combine multiple
tables for fast search
operation
Garbage collection
45
48. Failover
* Active master sets NEPTUNE_MASTER lock to
Pleiades releases NEPTUNE_MASTER lock if
* Pleiades
active master is fault & slave master gets the lock
48
49. Concluding Remarks
Big Data
◦ Volume, Velocity, Variety
Store/Manage Big Data
◦ Hadoop, NoSQL in cluster
Parallel Programming
◦ MapReduce
Analytics (Mining & Visualization)
◦ Mahout, R
49
50. Future Plan: Infra
A pilot system for Big Data (2012. 12)
◦ 1 Manage Server
◦ 1 Name Node
2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD
◦ 5 Data Nodes
2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD
◦ Rack Mount
◦ Gigabit Switch Hub
◦ Hadoop & CDH
50
51. Future Plan: Research
Developing machine learning,
modeling and optimization algorithms
for mining/visualizing public data in TB
Re-designing existing data mining
algorithms using MapReduce
◦ Data Mining Algorithms: Clustering,
Classification, Probabilistic Modeling,
Association Rule Mining, Graph Analysis,
etc.
◦ Serialization algorithms => Parallel
algorithms 51
52. Reference
G. Shim, MapReduce Algorithms for
Big Data Analysis, VLDB 2012 Tutorial
J. Schindler, I/O Characteristics of
NoSQL Databases, VLDB 2012
Tutorial
J. Leskovec, Mining Massive Datasets,
Available: http://cs246.stanford.edu
김형준, Neptune: 대용량 분산 데이터
관리 시스템, NHN Tech. Report, 2008
T. White, Hadoop: The Definitive
Guide, O’Reilly 2012 52