An Open Source Cloud for Data Intensive Computing

Sector: An Open Source Cloud
for Data Intensive Computing
Robert Grossman
University of Illinois at Chicago
Open Data Group
YunhongGu
University of Illinois at Chicago

April 20, 2009

Part 1
Varieties of Clouds

2

What is a Cloud?
 Clouds provide on-demand resources or
services over a network with the scale and
reliability of a data center.
 No standard definition.
 Cloud architectures are not new.
 What is new:
– Scale
– Ease of use
– Pricing model.

3

Categories of Clouds

 On-demand resources & services over the
Internet at the scale of a data center
 On-demand computing instances
– IaaS: Amazon EC2, S3, etc.; Eucalyptus
– supports many Web 2.0 users
 On-demand computing capacity
– Data intensive computing
– (say 100 TB, 500 TB, 1PB, 5PB)
– GFS/MapReduce/Bigtable, Hadoop, Sector, …
4

Requirements for Clouds Designed for
Data Intensive Computing
Scale to Scale Support Security
Data Across Large Data
Centers Data Flows
Centers
Business X X
E-science X X X
Health- X X
care

Sector/Sphere is a cloud designed for data intensive
computing supporting all four requirements.

Sector Overview
 Sector is fast
– Over 2x faster than Hadoop using MalStone Benchmark
– Sector exploits data locality and network topology to improve
performance
 Sector is easy to program
– Supports MapReduce style over (key, value) pairs
– Supports User-defined Functions over records
– Easy to process binary data (images, specialized formats, etc.)
 Sector clouds can be wide area

6

Part 2. Sector Design

7

Google’s Layered Cloud Services

Applications

Google’s MapReduce
Compute Services

Google’s BigTable
Data Services

Google File System (GFS)
Storage Services

Google’s Stack

8

Hadoop’s Layered Cloud Services

Applications

Hadoop’sMapReduce
Compute Services

Data Services

Hadoop Distributed File
Storage Services
System (HDFS)

Hadoop’s Stack

9

Sector’s Layered Cloud Services

Applications

Sphere’s UDFs
Compute Services

Data Services
Sector’s Distributed File
Storage Services System (SDFS)

Routing & UDP-based Data Transport
Transport Services Protocol (UDT)
Sector’s Stack
10

Computing an Inverted Index
Using Hadoop’sMapReduce
HTML page_1 Stage 2:
Sort each bucket on local
word_x word_y word_y word_z
node, merge the same word
Map
Bucket-A Bucket-A
word_x Page_1
Bucket-B Bucket-B
word_y Page_1
word_z Page_1
Sort
Reduce
Bucket-Z Bucket-Z

1st char
word_z Page_1 word_z 1, 5, 10
Shuffle
word_z Page_5
Stage 1: Page_10
word_z
Process each HTML file and hash
(word, file_id) pair to buckets

Idea 1 – Support UDF’s Over Files

 Think of MapReduce as
– Map acting on (text) records
– With fixed Shuffle and Sort
– Followed by Reducing acting on (text) records
 We generalize this framework as follows:
– Support a sequence of User Defined Functions
(UDF) acting on segments (=chunks) of files.
– In both cases, framework takes care of assigning
nodes to process data, restarting failed processes,
etc.
12

Computing an Inverted Index Using
Sphere’s User Defined Functions (UDF)
HTML page_1 Stage 2:
Sort each bucket on local
word_x word_y word_y word_z
node, merge the same word
UDF1 - Map
Bucket-A Bucket-A
word_x Page_1
Bucket-B Bucket-B
word_y Page_1
UDF4-
word_z Page_1
UDF3 - Sort
Reduce
Bucket-Z Bucket-Z

1st char
word_z Page_1 word_z 1, 5, 10
UDF2 - Shuffle
word_z Page_5
Stage 1: Page_10
word_z
Process each HTML file and hash
(word, file_id) pair to buckets

Applying UDF using Sector/Sphere
1. Split data
Application Sphere Client

Input
stream

2. Locate & SPE SPE SPE
schedule SPE
3. Collect results

Output
stream
14

Sphere’s UDF

Input UDF Output

Input UDF Intermediate UDF Output

Input 1
UDF Output
Input 2

Sector Programming Model

 Sector dataset consists of one or more physical files
 Sphere applies User Defined Functions over streams of
data consisting of data segments
 Data segments can be data records, collections of data
records, or files
 Example of UDFs: Map function, Reduce function, Split
function for CART, etc.
 Outputs of UDFs can be returned to originating node,
written to local node, or shuffled to another node.

16

Idea 2: Add Security From the Start
 Security server maintains
Security
Master Client information about users
Server
SSL and slaves.
SSL
 User access control:
password and client IP
address.
AAA data
 File level access control.
 Messages are encrypted
over SSL. Certificate is
used for authentication.
 Sector is HIPAA capable.
Slaves

Idea 3: Extend the Stack

Compute Services Compute Services

Data Services Data Services

Storage Services Storage Services

Routing &
Google, Hadoop Transport Services

Sector

18

Sector is Built on Top of UDT
• UDT is a specialized network transport
protocol.
• UDT can take advantage of wide area high
performance 10 Gbps network
• Sector is a wide area distributed file system
built over UDT.
• Sector is layered over the native file system (vs
being a block-based file system).

19

UDT Has Been Downloaded 25,000+ Times

udt.sourceforge.net Sterling Commerce Movie2Me

Globus
Power Folder
Nifty TV

20

Alternatives to TCP –
Decreasing Increases AIMD Protocols
(x)

UDT
Scalable TCP

HighSpeed TCP

AIMD (TCP NewReno)

x

increase of packet sending rate x

decrease factor

Using UDT Enables Wide Area Clouds

10 Gbps per
application

 Using UDT, Sector can take advantage of wide
area high performance networks (10+ Gbps)
22

Part 3. Experimental Studies

23

Comparing Sector and Hadoop
Hadoop Sector
Storage Cloud Block-based file File-based
system
Programming MapReduce UDF&MapReduc
Model e
Protocol TCP UDP-based
protocol (UDT)
Replication At time of writing Periodically
Security Not yet HIPAA capable
Language Java C++
24

Open Cloud Testbed – Phase 1 (2008)

C-Wave
CENIC Dragon
Phase 1
 Hadoop
 4 racks
 Sector/Sphere
 120 Nodes MREN  Thrift
 480 Cores
 Eucalyptus
 10+ Gb/s
Each node in the testbed is a Dell 1435 computer with 12 GB memory, 1TB
disk, 2.0GHz dual dual-core AMD Opteron 2212, with 1 Gb/s network interface
cards. 25

MalStone Benchmark

 Benchmark developed by Open Cloud
Consortium for clouds supporting data
intensive computing.
 Code to generate synthetic data required is
available from code.google.com/p/malgen
 Stylized analytic computation that is easy to
implement in MapReduce and its
generalizations.

26

MalStone B
entities
sites

dk-2 dk-1 dk
time
27

MalStone B Benchmark

MalStone B
Hadoop v0.18.3 799 min
Hadoop Streamingv0.18.3 142 min
Sector v1.19 44 min
# Nodes 20 nodes
# Records 10 Billion
Size of Dataset 1 TB

These are preliminary results and we expect these results to
change as we improve the implementations of MalStone B.

28

Terasort - Sector vsHadoop Performance

LAN MAN WAN 1 WAN 2
Number 58 116 178 236
Cores
Hadoop 2252 2617 3069 3702
(secs)
Sector 1265 1301 1430 1526
(secs)
Locations UIC UIC, SL UIC, SL, UIC, SL,
Calit2 Calit2,
JHU
All times in seconds.

With Sector, “Wide Area Penalty” < 5%
 Used Open Cloud Testbed.
 And wide area 10 Gb/sec networks.
 Ran a data intensive computing benchmark on 4
clusters distributed across the U.S. vs one cluster
in Chicago.
 Difference in performance less than 5% for
Terasort.
 One expects quite different results, depending
upon the particular computation.
30

Penalty for Wide Area Cloud
Computing on Uncongested 10 Gb/s

28 Local 4x 7 distributed Wide Area
Nodes Nodes “Penality”
Hadoop 3 8650 11600 34%
replicas
Hadoop 1 7300 9600 31%
replica
Sector 4200 4400 4.7%

All times in seconds usingMalStoneA benchmark on Open Cloud Testbed.

31

For More Information & To Obtain Sector
 To obtain Sector or learn more about it:
sector.sourceforge.net
 To learn more about the Open Cloud Consortium
www.opencloudconsortium.org
 For related work by Robert Grossman
blog.rgrossman.com, www.rgrossman.com
 For related work by YunhongGu
www.lac.uic.edu/~yunhong
32

An Open Source Cloud for Data Intensive Computing

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a An Open Source Cloud for Data Intensive Computing

Semelhante a An Open Source Cloud for Data Intensive Computing (20)

Mais de Robert Grossman

Mais de Robert Grossman (20)

Último

Último (20)

An Open Source Cloud for Data Intensive Computing