Sector is an open source cloud platform designed for data intensive computing. It provides several advantages over Hadoop such as being up to 2x faster, supporting user defined functions, and exploiting data locality and network topology. Sector uses a layered architecture with user defined functions, a distributed file system, and a UDP-based transport protocol. Experimental results show Sector outperforms Hadoop on benchmarks and has less than a 5% performance penalty compared to a local cluster when run on distributed wide area clusters connected by 10Gbps networks.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
An Open Source Cloud for Data Intensive Computing
1. Sector: An Open Source Cloud
for Data Intensive Computing
Robert Grossman
University of Illinois at Chicago
Open Data Group
YunhongGu
University of Illinois at Chicago
April 20, 2009
3. What is a Cloud?
Clouds provide on-demand resources or
services over a network with the scale and
reliability of a data center.
No standard definition.
Cloud architectures are not new.
What is new:
– Scale
– Ease of use
– Pricing model.
3
4. Categories of Clouds
On-demand resources & services over the
Internet at the scale of a data center
On-demand computing instances
– IaaS: Amazon EC2, S3, etc.; Eucalyptus
– supports many Web 2.0 users
On-demand computing capacity
– Data intensive computing
– (say 100 TB, 500 TB, 1PB, 5PB)
– GFS/MapReduce/Bigtable, Hadoop, Sector, …
4
5. Requirements for Clouds Designed for
Data Intensive Computing
Scale to Scale Support Security
Data Across Large Data
Centers Data Flows
Centers
Business X X
E-science X X X
Health- X X
care
Sector/Sphere is a cloud designed for data intensive
computing supporting all four requirements.
6. Sector Overview
Sector is fast
– Over 2x faster than Hadoop using MalStone Benchmark
– Sector exploits data locality and network topology to improve
performance
Sector is easy to program
– Supports MapReduce style over (key, value) pairs
– Supports User-defined Functions over records
– Easy to process binary data (images, specialized formats, etc.)
Sector clouds can be wide area
6
10. Sector’s Layered Cloud Services
Applications
Sphere’s UDFs
Compute Services
Data Services
Sector’s Distributed File
Storage Services System (SDFS)
Routing & UDP-based Data Transport
Transport Services Protocol (UDT)
Sector’s Stack
10
11. Computing an Inverted Index
Using Hadoop’sMapReduce
HTML page_1 Stage 2:
Sort each bucket on local
word_x word_y word_y word_z
node, merge the same word
Map
Bucket-A Bucket-A
word_x Page_1
Bucket-B Bucket-B
word_y Page_1
word_z Page_1
Sort
Reduce
Bucket-Z Bucket-Z
1st char
word_z Page_1 word_z 1, 5, 10
Shuffle
word_z Page_5
Stage 1: Page_10
word_z
Process each HTML file and hash
(word, file_id) pair to buckets
12. Idea 1 – Support UDF’s Over Files
Think of MapReduce as
– Map acting on (text) records
– With fixed Shuffle and Sort
– Followed by Reducing acting on (text) records
We generalize this framework as follows:
– Support a sequence of User Defined Functions
(UDF) acting on segments (=chunks) of files.
– In both cases, framework takes care of assigning
nodes to process data, restarting failed processes,
etc.
12
13. Computing an Inverted Index Using
Sphere’s User Defined Functions (UDF)
HTML page_1 Stage 2:
Sort each bucket on local
word_x word_y word_y word_z
node, merge the same word
UDF1 - Map
Bucket-A Bucket-A
word_x Page_1
Bucket-B Bucket-B
word_y Page_1
UDF4-
word_z Page_1
UDF3 - Sort
Reduce
Bucket-Z Bucket-Z
1st char
word_z Page_1 word_z 1, 5, 10
UDF2 - Shuffle
word_z Page_5
Stage 1: Page_10
word_z
Process each HTML file and hash
(word, file_id) pair to buckets
16. Sector Programming Model
Sector dataset consists of one or more physical files
Sphere applies User Defined Functions over streams of
data consisting of data segments
Data segments can be data records, collections of data
records, or files
Example of UDFs: Map function, Reduce function, Split
function for CART, etc.
Outputs of UDFs can be returned to originating node,
written to local node, or shuffled to another node.
16
17. Idea 2: Add Security From the Start
Security server maintains
Security
Master Client information about users
Server
SSL and slaves.
SSL
User access control:
password and client IP
address.
AAA data
File level access control.
Messages are encrypted
over SSL. Certificate is
used for authentication.
Sector is HIPAA capable.
Slaves
18. Idea 3: Extend the Stack
Compute Services Compute Services
Data Services Data Services
Storage Services Storage Services
Routing &
Google, Hadoop Transport Services
Sector
18
19. Sector is Built on Top of UDT
• UDT is a specialized network transport
protocol.
• UDT can take advantage of wide area high
performance 10 Gbps network
• Sector is a wide area distributed file system
built over UDT.
• Sector is layered over the native file system (vs
being a block-based file system).
19
20. UDT Has Been Downloaded 25,000+ Times
udt.sourceforge.net Sterling Commerce Movie2Me
Globus
Power Folder
Nifty TV
20
21. Alternatives to TCP –
Decreasing Increases AIMD Protocols
(x)
UDT
Scalable TCP
HighSpeed TCP
AIMD (TCP NewReno)
x
increase of packet sending rate x
decrease factor
22. Using UDT Enables Wide Area Clouds
10 Gbps per
application
Using UDT, Sector can take advantage of wide
area high performance networks (10+ Gbps)
22
24. Comparing Sector and Hadoop
Hadoop Sector
Storage Cloud Block-based file File-based
system
Programming MapReduce UDF&MapReduc
Model e
Protocol TCP UDP-based
protocol (UDT)
Replication At time of writing Periodically
Security Not yet HIPAA capable
Language Java C++
24
25. Open Cloud Testbed – Phase 1 (2008)
C-Wave
CENIC Dragon
Phase 1
Hadoop
4 racks
Sector/Sphere
120 Nodes MREN Thrift
480 Cores
Eucalyptus
10+ Gb/s
Each node in the testbed is a Dell 1435 computer with 12 GB memory, 1TB
disk, 2.0GHz dual dual-core AMD Opteron 2212, with 1 Gb/s network interface
cards. 25
26. MalStone Benchmark
Benchmark developed by Open Cloud
Consortium for clouds supporting data
intensive computing.
Code to generate synthetic data required is
available from code.google.com/p/malgen
Stylized analytic computation that is easy to
implement in MapReduce and its
generalizations.
26
27. MalStone B
entities
sites
dk-2 dk-1 dk
time
27
28. MalStone B Benchmark
MalStone B
Hadoop v0.18.3 799 min
Hadoop Streamingv0.18.3 142 min
Sector v1.19 44 min
# Nodes 20 nodes
# Records 10 Billion
Size of Dataset 1 TB
These are preliminary results and we expect these results to
change as we improve the implementations of MalStone B.
28
29. Terasort - Sector vsHadoop Performance
LAN MAN WAN 1 WAN 2
Number 58 116 178 236
Cores
Hadoop 2252 2617 3069 3702
(secs)
Sector 1265 1301 1430 1526
(secs)
Locations UIC UIC, SL UIC, SL, UIC, SL,
Calit2 Calit2,
JHU
All times in seconds.
30. With Sector, “Wide Area Penalty” < 5%
Used Open Cloud Testbed.
And wide area 10 Gb/sec networks.
Ran a data intensive computing benchmark on 4
clusters distributed across the U.S. vs one cluster
in Chicago.
Difference in performance less than 5% for
Terasort.
One expects quite different results, depending
upon the particular computation.
30
31. Penalty for Wide Area Cloud
Computing on Uncongested 10 Gb/s
28 Local 4x 7 distributed Wide Area
Nodes Nodes “Penality”
Hadoop 3 8650 11600 34%
replicas
Hadoop 1 7300 9600 31%
replica
Sector 4200 4400 4.7%
All times in seconds usingMalStoneA benchmark on Open Cloud Testbed.
31
32. For More Information & To Obtain Sector
To obtain Sector or learn more about it:
sector.sourceforge.net
To learn more about the Open Cloud Consortium
www.opencloudconsortium.org
For related work by Robert Grossman
blog.rgrossman.com, www.rgrossman.com
For related work by YunhongGu
www.lac.uic.edu/~yunhong
32