Big data

Introduction to Big Data

Byung-Won On, PhD
Seoul National University
December 18, 2012

Outline
 Big Data
 MapReduce
◦ Word Count
◦ k-means Clustering Algorithm
 NoSQL
◦ Neptune
 Demo
◦ Hadoop Installation
◦ Word Count using MapReduce
2

Big data in my opinion

3

Main keywords related to Big
data
 Data in TB or PB
 Volume
 Velocity
 Variety
 Value
 Complexity

Source: Gartner 2012.11

4

Big Data Platform

Source: 안창원, 황승구, 빅 데이터 기술과 주요 이슈, 정보과학
회지 30권 6호 pp.10-17, 2012
5

Big Data Technology
 Distributed File Systems
◦ GFS, HDFS
 Databases
◦ Oracle, DB2, MySQL (RDBMS)
◦ Bigtable, Hbase, Cassandra, MongoDB (NoSQL)
 Parallel Programming Model
◦ MapReduce, Hive, Pig
 Analytics & Visualization
◦ Mahout, R, Tableau, Nutch

6

HADOOP & MAPREDUCE

7

Motivated Example
 20 billion web pages * 20KB per web
page
◦ About 400TB
 A computer reads 30-35MB/sec from
disk
◦ It takes about 4 months to read

8

Cluster Architecture

Source:
9
http://cs246.stanford.edu

Challenges
 How do we distribute computation?
 How can we make it easy to write
parallel programs?
 How can we handle machine failure?

10

Approach
 Distributed File Systems
◦ Google File System, Hadoop Distributed
File System
 Parallel Programming Framework
◦ MapReduce in Hadoop

11

Distributed File System
 Problem
◦ If nodes fail, how to store data persistently?
 Solution
◦ Hadoop Distributed File System (HDFS),
providing global file name space
 Properties of Data for HDFS
◦ Huge files ~ xx TB
◦ Data is rarely updated in place (i.e.,
immutable files)
◦ Read/append operations are common

12

Distributed File System
 Name Node
◦ Store meta data
◦ Active/Stand by
 Data Nodes
◦ File is split into contiguous chunks
◦ Each chunk ~ 64MB
◦ Each chunk is replicated ~ 3x
◦ Keep replicas in different racks
 Client (to access files)
◦ Contact the name node to find data nodes
◦ Directly connect to data nodes to access
data
13

MapReduce Programming
Architecture

14

Overview
 Sequentially read big data
 Map
◦ Extract something you care about
 Group by key
◦ Sort and shuffle
 Reduce
◦ Aggregate, summarize, filter, transform
 Write the result

15

Map Step

Source:
16

Reduce Step

Source:
17

Algorithm
 Input
◦ A set of key/value pairs
 A programmer specifies two methods
◦ Map(k,v) => <k’,v’>*
 Takes a key value pair and outputs a set of key
value pairs
 Ex) key is the filename & value is a single line
in the file
◦ Reduce(k’,<v’>*) => <k’,v’’>*
 All values v’ with the same key k’ are reduced
together and processed in v’ order

18

Word Counting using
MapReduce

Source:
19

Word Counting using
MapReduce
 Map(key, value)
//key: document name; value: text of the
document
for each word w in value:
emit(w,1)

 Reduce(key, values)
//key: a word; value: an iterator over counts
Result = 0
for each count v in values
result += v
emit(key, result)
20

MapReduce Task
 Partition the input data
 Schedule the program execution
across nodes
 Handle machine failures
 Managing inter communication
between nodes

21

Parallel Processing

22

Name Node
 Task status
◦ Idle, in-progress, completed
 Idle tasks are get scheduled when workers
are available
 When a map task finishes, it sends to name
node the location and sizes of its
intermediate files, one for each reduce
worker
 Name node pushes this information to
reducers
 Name node regularly pings workers to detect
failures
23

Failover
 Map worker failure
◦ Map tasks are completed or in-progress at
worker are reset to idle
◦ Reduce workers are notified when task is
rescheduled on another worker
 Reduce worker failure
◦ Only in-progress tasks are reset to idle
 Master failure
◦ MapReduce task is aborted and client is
notified

24

Set-up
 Map tasks: M
 Reduce tasks: R
 M, R are much larger than # of nodes
in cluster
 One chunk data per map is common
 Improve dynamic load balancing
 Speed recovery from worker failure
 Often R is smaller than M
◦ Note that output is spread across R files

25

Combiners
 A map task will produce many pairs of
(k,v1), (k,v2), … for the same key k
 Popular words in word count
 Pre aggregate values at the mapper
◦ Combine(k, list(v1)) -> v2
◦ Combiner is the same as the reduce
function

26

K-MEANS USING
MAPREDUCE

30

Mixed Entities in Web
The search result includes a
mixture of web pages with
different Tom Mitchells

Separate different web pages
into different groups (called
clusters)

Byung-Won On, Ingyu Lee and Dongwon Lee, Scalable
clustering methods for the name disambiguation problem,
Knowledge and Information Systems 31(1):129-151 (2012)
31

Clustering

Web pages of two different persons with the same name spellings are all mixed in the pool

a2
a2 a1
a3
a1
a3
b1
b1
b2
b2

32

k-means
w4
w2
1. Random selection of w5
w3
cluster centroid w1

w4 w4
w2 w2
2. Measure distance w5 w5
w3 w3
between centroid and w1
w1

object From w4
From w3
3. Assign each object to
each centroid based on w2
w4

short distance w5
w3
w1
4. Choose new centroid
in each cluster based on
w4
mean in each cluster w2
w5
w3
5. Repeat step 2 to step w1

4 until convergence
criterion is met.
33

k-means using MapReduce
 Do
◦ Map
 Input is a data point and k centers are
broadcasted
 Find the closest center among k centers for the
input point
◦ Reduce
 Input is one of k centers and all data points
having this center as their closest center
 Calculate the new center using data points
 Until all of new centers are not
changed 34

Relational DB

 Manage data in GB or TB
 Store important data
(transaction, personnel, …)
 Guarantee both consistency and
availability
 Oracle, DB2, MS SQL Server, MySQL

38

Not Only SQL (NoSQL)
 Manage unstructured data such as text
in TB or PB
 Guarantee partition tolerance
 Guarantee either consistency or
availability
 Flexible Schema
 No SQL and join operations
 Big Table (Google), Dynamo (Amazon),
Hbase (Yahoo), Cassandra (Facebook),
MongoDB, Neptune (NHN)
◦ Big Table = Hbase = Neptune
39

Neptune: For Managing Big
Data
 Analyzing log data from Internet portal
or online game service
 Calculating PageRank or similarities
between web pages
 Search for personalization
 Social network analysis, recommender
systems, blog clustering, etc.

40

System Architecture

41

Component Nodes
 Master Node
◦ Assign tablets to TabletServers, considering the
number and size of tablets
 TabletServers
◦ Provide insertion/deletion with clients
◦ Store a few thousands tablets (100~200MB per
tablet) => a few hundreds GB
◦ In-Memory & Disk DB
◦ Merge tablets if # of files is increasing (Improvement
of search operation)
◦ Split tablets if the size of a file is large
◦ (Improvement of performance)
 Changelog Servers
◦ Store transaction log

42

Data Format
 Logical data unit
◦ Table
 Row
◦ Rowkey created by
systems automatically
◦ Sorted in ascending order
 Column
◦ Column key, timestamp
◦ Sorted in lexical order
◦ Get operation
 Return a recent data
◦ Column oriented indexing
 Divide a table into a set
of tablets by rowkey
 Store tablets in cluster

43

Meta Data

Meta data is stored in the shared memory of
Pleiades
44

Real Time Processing
 Reasonable
performance
◦ A few ms response
time
 In-Memory DB
 Minor compaction
◦ When a memory table
is full
 Major compaction
◦ Combine multiple
tables for fast search
operation
 Garbage collection
45

Client API & Shell Command

47

Failover

* Active master sets NEPTUNE_MASTER lock to
Pleiades releases NEPTUNE_MASTER lock if
* Pleiades
active master is fault & slave master gets the lock
48

Concluding Remarks
 Big Data
◦ Volume, Velocity, Variety
 Store/Manage Big Data
◦ Hadoop, NoSQL in cluster
 Parallel Programming
◦ MapReduce
 Analytics (Mining & Visualization)
◦ Mahout, R

49

Future Plan: Infra
 A pilot system for Big Data (2012. 12)
◦ 1 Manage Server
◦ 1 Name Node
 2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD
◦ 5 Data Nodes
 2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD
◦ Rack Mount
◦ Gigabit Switch Hub
◦ Hadoop & CDH

50

Future Plan: Research
 Developing machine learning,
modeling and optimization algorithms
for mining/visualizing public data in TB
 Re-designing existing data mining
algorithms using MapReduce
◦ Data Mining Algorithms: Clustering,
Classification, Probabilistic Modeling,
Association Rule Mining, Graph Analysis,
etc.
◦ Serialization algorithms => Parallel
algorithms 51

Reference
 G. Shim, MapReduce Algorithms for
Big Data Analysis, VLDB 2012 Tutorial
 J. Schindler, I/O Characteristics of
NoSQL Databases, VLDB 2012
Tutorial
 J. Leskovec, Mining Massive Datasets,
Available: http://cs246.stanford.edu
 김형준, Neptune: 대용량 분산 데이터
관리 시스템, NHN Tech. Report, 2008
 T. White, Hadoop: The Definitive
Guide, O’Reilly 2012 52

Outline
 Hadoop Installation
 Word Counting using MapReduce

54

Software for Hadoop
Installation

VirtualBox 4.1.22
https://www.virtualbox.org/

http://www.centos.org/ http://www.oracle.com/tec
hnetwork/java/index.html

http://apache.tt.co.kr/hadoop/commo
n/hadoop-1.0.4/hadoop-1.0.4.tar.gz
55

Hadoop Installation

Configur
JDK Hadoop
ation

tar xvf hadoop-1.0.4-bin.tar.gz
56

Three Modes for Hadoop
Installation

Pseudo- Fully
Standalone distributed distributed

57

Configuration
 hadoop-1.0.4/conf
 Hadoop-env.sh

 core-site.xml
 hdfs-site.xml
 mapred-site.xml

58

Pseudo-Distributed Mode

 core-site.xml

 mapred-site.xml

 hdfs-site.xml

59

Completion of Hadoop
Installation

60

Word Count using
MapReduce

61

Word Count using
MapReduce

62

Word Count using
MapReduce

63

Word Count using
MapReduce

64

Text Data
Input file Output file

65

Patent ID Data
Input file Output file

66

Big data

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (17)

Similar to Big data

Similar to Big data (20)

More from Waqas Nawaz

More from Waqas Nawaz (12)

Recently uploaded

Recently uploaded (20)

Big data