SlideShare uma empresa Scribd logo
1 de 33
Sector: An Open Source Cloud
for Data Intensive Computing
        Robert Grossman
  University of Illinois at Chicago
        Open Data Group
            YunhongGu
  University of Illinois at Chicago



            April 20, 2009
Part 1
Varieties of Clouds




                      2
What is a Cloud?
 Clouds provide on-demand resources or
  services over a network with the scale and
  reliability of a data center.
 No standard definition.
 Cloud architectures are not new.
 What is new:
  – Scale
  – Ease of use
  – Pricing model.

                                               3
Categories of Clouds

 On-demand resources & services over the
  Internet at the scale of a data center
 On-demand computing instances
  – IaaS: Amazon EC2, S3, etc.; Eucalyptus
  – supports many Web 2.0 users
 On-demand computing capacity
  – Data intensive computing
  – (say 100 TB, 500 TB, 1PB, 5PB)
  – GFS/MapReduce/Bigtable, Hadoop, Sector, …
                                                4
Requirements for Clouds Designed for
 Data Intensive Computing
            Scale to   Scale     Support      Security
            Data       Across    Large Data
            Centers    Data      Flows
                       Centers
 Business   X                                 X
 E-science X           X         X
 Health-    X                                 X
 care

Sector/Sphere is a cloud designed for data intensive
computing supporting all four requirements.
Sector Overview
 Sector is fast
   – Over 2x faster than Hadoop using MalStone Benchmark
   – Sector exploits data locality and network topology to improve
     performance
 Sector is easy to program
   – Supports MapReduce style over (key, value) pairs
   – Supports User-defined Functions over records
   – Easy to process binary data (images, specialized formats, etc.)
 Sector clouds can be wide area


                                                                   6
Part 2. Sector Design




                        7
Google’s Layered Cloud Services

     Applications

                      Google’s MapReduce
  Compute Services

                      Google’s BigTable
 Data Services

                      Google File System (GFS)
   Storage Services


    Google’s Stack

                                                 8
Hadoop’s Layered Cloud Services

     Applications

                      Hadoop’sMapReduce
  Compute Services

 Data Services

                      Hadoop Distributed File
   Storage Services
                      System (HDFS)

   Hadoop’s Stack


                                                9
Sector’s Layered Cloud Services

     Applications

                       Sphere’s UDFs
   Compute Services

 Data Services
                       Sector’s Distributed File
   Storage Services    System (SDFS)

      Routing &        UDP-based Data Transport
  Transport Services   Protocol (UDT)
    Sector’s Stack
                                                   10
Computing an Inverted Index
Using Hadoop’sMapReduce
HTML page_1                                   Stage 2:
                                              Sort each bucket on local
word_x word_y word_y word_z
                                              node, merge the same word
             Map
                                   Bucket-A                 Bucket-A
word_x     Page_1
                                   Bucket-B                 Bucket-B
word_y     Page_1
word_z     Page_1
                            Sort
                                               Reduce
                                   Bucket-Z                 Bucket-Z

1st char
                                   word_z       Page_1     word_z      1, 5, 10
              Shuffle
                                   word_z       Page_5
Stage 1:                                        Page_10
                                   word_z
Process each HTML file and hash
(word, file_id) pair to buckets
Idea 1 – Support UDF’s Over Files

 Think of MapReduce as
  – Map acting on (text) records
  – With fixed Shuffle and Sort
  – Followed by Reducing acting on (text) records
 We generalize this framework as follows:
  – Support a sequence of User Defined Functions
    (UDF) acting on segments (=chunks) of files.
  – In both cases, framework takes care of assigning
    nodes to process data, restarting failed processes,
    etc.
                                                          12
Computing an Inverted Index Using
Sphere’s User Defined Functions (UDF)
HTML page_1                                  Stage 2:
                                             Sort each bucket on local
word_x word_y word_y word_z
                                             node, merge the same word
            UDF1 - Map
                                  Bucket-A                 Bucket-A
word_x     Page_1
                                  Bucket-B                 Bucket-B
word_y     Page_1
                                             UDF4-
word_z     Page_1
                        UDF3 - Sort
                                             Reduce
                                  Bucket-Z                 Bucket-Z

1st char
                                  word_z       Page_1     word_z      1, 5, 10
         UDF2 - Shuffle
                                  word_z       Page_5
Stage 1:                                       Page_10
                                  word_z
Process each HTML file and hash
(word, file_id) pair to buckets
Applying UDF using Sector/Sphere
                                                 1. Split data
     Application              Sphere Client


Input
stream

         2. Locate &    SPE       SPE     SPE
         schedule SPE
                                                3. Collect results

               Output
               stream
                                                                 14
Sphere’s UDF

  Input           UDF              Output




  Input     UDF         Intermediate        UDF   Output




  Input 1
                  UDF              Output
  Input 2
Sector Programming Model

 Sector dataset consists of one or more physical files
 Sphere applies User Defined Functions over streams of
  data consisting of data segments
 Data segments can be data records, collections of data
  records, or files
 Example of UDFs: Map function, Reduce function, Split
  function for CART, etc.
 Outputs of UDFs can be returned to originating node,
  written to local node, or shuffled to another node.

                                                      16
Idea 2: Add Security From the Start
                              Security server maintains
Security
           Master     Client   information about users
Server
       SSL                     and slaves.
                  SSL
                               User access control:
                                password and client IP
                                address.
       AAA            data
                               File level access control.
                               Messages are encrypted
                                over SSL. Certificate is
                                used for authentication.
                               Sector is HIPAA capable.
         Slaves
Idea 3: Extend the Stack

   Compute Services     Compute Services


  Data Services        Data Services

    Storage Services       Storage Services

                            Routing &
    Google, Hadoop      Transport Services

                                Sector


                                              18
Sector is Built on Top of UDT
• UDT is a specialized network transport
  protocol.
• UDT can take advantage of wide area high
  performance 10 Gbps network
• Sector is a wide area distributed file system
  built over UDT.
• Sector is layered over the native file system (vs
  being a block-based file system).

                                                      19
UDT Has Been Downloaded 25,000+ Times




udt.sourceforge.net   Sterling Commerce   Movie2Me




 Globus
                                          Power Folder
                        Nifty TV

                                                         20
Alternatives to TCP –
Decreasing Increases AIMD Protocols
(x)


UDT
              Scalable TCP

             HighSpeed TCP

          AIMD (TCP NewReno)

                   x



                    increase of packet sending rate x

                    decrease factor
Using UDT Enables Wide Area Clouds




               10 Gbps per
               application


 Using UDT, Sector can take advantage of wide
  area high performance networks (10+ Gbps)
                                                 22
Part 3. Experimental Studies




                               23
Comparing Sector and Hadoop
                Hadoop             Sector
Storage Cloud   Block-based file   File-based
                system
Programming     MapReduce          UDF&MapReduc
Model                              e
Protocol        TCP                UDP-based
                                   protocol (UDT)
Replication     At time of writing Periodically
Security        Not yet            HIPAA capable
Language        Java               C++
                                                    24
Open Cloud Testbed – Phase 1 (2008)


                                       C-Wave
               CENIC                                     Dragon
Phase 1
                                                            Hadoop
 4 racks
                                                            Sector/Sphere
 120 Nodes                              MREN               Thrift
 480 Cores
                                                            Eucalyptus
 10+ Gb/s
Each node in the testbed is a Dell 1435 computer with 12 GB memory, 1TB
disk, 2.0GHz dual dual-core AMD Opteron 2212, with 1 Gb/s network interface
cards.                                                                        25
MalStone Benchmark

 Benchmark developed by Open Cloud
  Consortium for clouds supporting data
  intensive computing.
 Code to generate synthetic data required is
  available from code.google.com/p/malgen
 Stylized analytic computation that is easy to
  implement in MapReduce and its
  generalizations.

                                                  26
MalStone B
                            entities
                    sites




  dk-2       dk-1    dk
                              time
                                       27
MalStone B Benchmark

                                        MalStone B
 Hadoop v0.18.3                         799 min
 Hadoop Streamingv0.18.3                142 min
 Sector v1.19                           44 min
 # Nodes                                20 nodes
 # Records                              10 Billion
 Size of Dataset                        1 TB

These are preliminary results and we expect these results to
change as we improve the implementations of MalStone B.

                                                               28
Terasort - Sector vsHadoop Performance

            LAN          MAN       WAN 1      WAN 2
Number      58           116       178        236
Cores
Hadoop    2252           2617      3069       3702
(secs)
Sector    1265           1301      1430       1526
(secs)
Locations UIC            UIC, SL   UIC, SL,   UIC, SL,
                                   Calit2     Calit2,
                                              JHU
 All times in seconds.
With Sector, “Wide Area Penalty” < 5%
 Used Open Cloud Testbed.
 And wide area 10 Gb/sec networks.
 Ran a data intensive computing benchmark on 4
  clusters distributed across the U.S. vs one cluster
  in Chicago.
 Difference in performance less than 5% for
  Terasort.
 One expects quite different results, depending
  upon the particular computation.
                                                   30
Penalty for Wide Area Cloud
Computing on Uncongested 10 Gb/s

                28 Local        4x 7 distributed Wide Area
                Nodes           Nodes            “Penality”
Hadoop 3        8650            11600                   34%
replicas
Hadoop 1        7300            9600                    31%
replica
Sector          4200            4400                    4.7%

All times in seconds usingMalStoneA benchmark on Open Cloud Testbed.

                                                                       31
For More Information & To Obtain Sector
 To obtain Sector or learn more about it:
              sector.sourceforge.net
 To learn more about the Open Cloud Consortium
          www.opencloudconsortium.org
 For related work by Robert Grossman
   blog.rgrossman.com, www.rgrossman.com
 For related work by YunhongGu
            www.lac.uic.edu/~yunhong
                                             32
Thank you!




             33

Mais conteúdo relacionado

Mais procurados

NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...Hanh Le Hieu
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloHortonworks
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bitsDipesh Lall
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Cost model for RFID-based traceability information systems
Cost model for RFID-based traceability information systemsCost model for RFID-based traceability information systems
Cost model for RFID-based traceability information systemsMiguel Pardal
 
Query processing and optimization
Query processing and optimizationQuery processing and optimization
Query processing and optimizationArif A.
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveSharjeel Imtiaz
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkSteve Loughran
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 

Mais procurados (20)

NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
 
lec4_ref.pdf
lec4_ref.pdflec4_ref.pdf
lec4_ref.pdf
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache Accumulo
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Ceph
CephCeph
Ceph
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bits
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Cost model for RFID-based traceability information systems
Cost model for RFID-based traceability information systemsCost model for RFID-based traceability information systems
Cost model for RFID-based traceability information systems
 
Hadoop
HadoopHadoop
Hadoop
 
Query processing and optimization
Query processing and optimizationQuery processing and optimization
Query processing and optimization
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talk
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 

Semelhante a An Open Source Cloud for Data Intensive Computing

Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionacogoluegnes
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfKUMARRISHAV37
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009lilyco
 

Semelhante a An Open Source Cloud for Data Intensive Computing (20)

Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
NetCDF and HDF5
NetCDF and HDF5NetCDF and HDF5
NetCDF and HDF5
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009
 

Mais de Robert Grossman

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsRobert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 

Mais de Robert Grossman (20)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 

Último

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

An Open Source Cloud for Data Intensive Computing

  • 1. Sector: An Open Source Cloud for Data Intensive Computing Robert Grossman University of Illinois at Chicago Open Data Group YunhongGu University of Illinois at Chicago April 20, 2009
  • 3. What is a Cloud?  Clouds provide on-demand resources or services over a network with the scale and reliability of a data center.  No standard definition.  Cloud architectures are not new.  What is new: – Scale – Ease of use – Pricing model. 3
  • 4. Categories of Clouds  On-demand resources & services over the Internet at the scale of a data center  On-demand computing instances – IaaS: Amazon EC2, S3, etc.; Eucalyptus – supports many Web 2.0 users  On-demand computing capacity – Data intensive computing – (say 100 TB, 500 TB, 1PB, 5PB) – GFS/MapReduce/Bigtable, Hadoop, Sector, … 4
  • 5. Requirements for Clouds Designed for Data Intensive Computing Scale to Scale Support Security Data Across Large Data Centers Data Flows Centers Business X X E-science X X X Health- X X care Sector/Sphere is a cloud designed for data intensive computing supporting all four requirements.
  • 6. Sector Overview  Sector is fast – Over 2x faster than Hadoop using MalStone Benchmark – Sector exploits data locality and network topology to improve performance  Sector is easy to program – Supports MapReduce style over (key, value) pairs – Supports User-defined Functions over records – Easy to process binary data (images, specialized formats, etc.)  Sector clouds can be wide area 6
  • 7. Part 2. Sector Design 7
  • 8. Google’s Layered Cloud Services Applications Google’s MapReduce Compute Services Google’s BigTable Data Services Google File System (GFS) Storage Services Google’s Stack 8
  • 9. Hadoop’s Layered Cloud Services Applications Hadoop’sMapReduce Compute Services Data Services Hadoop Distributed File Storage Services System (HDFS) Hadoop’s Stack 9
  • 10. Sector’s Layered Cloud Services Applications Sphere’s UDFs Compute Services Data Services Sector’s Distributed File Storage Services System (SDFS) Routing & UDP-based Data Transport Transport Services Protocol (UDT) Sector’s Stack 10
  • 11. Computing an Inverted Index Using Hadoop’sMapReduce HTML page_1 Stage 2: Sort each bucket on local word_x word_y word_y word_z node, merge the same word Map Bucket-A Bucket-A word_x Page_1 Bucket-B Bucket-B word_y Page_1 word_z Page_1 Sort Reduce Bucket-Z Bucket-Z 1st char word_z Page_1 word_z 1, 5, 10 Shuffle word_z Page_5 Stage 1: Page_10 word_z Process each HTML file and hash (word, file_id) pair to buckets
  • 12. Idea 1 – Support UDF’s Over Files  Think of MapReduce as – Map acting on (text) records – With fixed Shuffle and Sort – Followed by Reducing acting on (text) records  We generalize this framework as follows: – Support a sequence of User Defined Functions (UDF) acting on segments (=chunks) of files. – In both cases, framework takes care of assigning nodes to process data, restarting failed processes, etc. 12
  • 13. Computing an Inverted Index Using Sphere’s User Defined Functions (UDF) HTML page_1 Stage 2: Sort each bucket on local word_x word_y word_y word_z node, merge the same word UDF1 - Map Bucket-A Bucket-A word_x Page_1 Bucket-B Bucket-B word_y Page_1 UDF4- word_z Page_1 UDF3 - Sort Reduce Bucket-Z Bucket-Z 1st char word_z Page_1 word_z 1, 5, 10 UDF2 - Shuffle word_z Page_5 Stage 1: Page_10 word_z Process each HTML file and hash (word, file_id) pair to buckets
  • 14. Applying UDF using Sector/Sphere 1. Split data Application Sphere Client Input stream 2. Locate & SPE SPE SPE schedule SPE 3. Collect results Output stream 14
  • 15. Sphere’s UDF Input UDF Output Input UDF Intermediate UDF Output Input 1 UDF Output Input 2
  • 16. Sector Programming Model  Sector dataset consists of one or more physical files  Sphere applies User Defined Functions over streams of data consisting of data segments  Data segments can be data records, collections of data records, or files  Example of UDFs: Map function, Reduce function, Split function for CART, etc.  Outputs of UDFs can be returned to originating node, written to local node, or shuffled to another node. 16
  • 17. Idea 2: Add Security From the Start  Security server maintains Security Master Client information about users Server SSL and slaves. SSL  User access control: password and client IP address. AAA data  File level access control.  Messages are encrypted over SSL. Certificate is used for authentication.  Sector is HIPAA capable. Slaves
  • 18. Idea 3: Extend the Stack Compute Services Compute Services Data Services Data Services Storage Services Storage Services Routing & Google, Hadoop Transport Services Sector 18
  • 19. Sector is Built on Top of UDT • UDT is a specialized network transport protocol. • UDT can take advantage of wide area high performance 10 Gbps network • Sector is a wide area distributed file system built over UDT. • Sector is layered over the native file system (vs being a block-based file system). 19
  • 20. UDT Has Been Downloaded 25,000+ Times udt.sourceforge.net Sterling Commerce Movie2Me Globus Power Folder Nifty TV 20
  • 21. Alternatives to TCP – Decreasing Increases AIMD Protocols (x) UDT Scalable TCP HighSpeed TCP AIMD (TCP NewReno) x increase of packet sending rate x decrease factor
  • 22. Using UDT Enables Wide Area Clouds 10 Gbps per application  Using UDT, Sector can take advantage of wide area high performance networks (10+ Gbps) 22
  • 23. Part 3. Experimental Studies 23
  • 24. Comparing Sector and Hadoop Hadoop Sector Storage Cloud Block-based file File-based system Programming MapReduce UDF&MapReduc Model e Protocol TCP UDP-based protocol (UDT) Replication At time of writing Periodically Security Not yet HIPAA capable Language Java C++ 24
  • 25. Open Cloud Testbed – Phase 1 (2008) C-Wave CENIC Dragon Phase 1  Hadoop  4 racks  Sector/Sphere  120 Nodes MREN  Thrift  480 Cores  Eucalyptus  10+ Gb/s Each node in the testbed is a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212, with 1 Gb/s network interface cards. 25
  • 26. MalStone Benchmark  Benchmark developed by Open Cloud Consortium for clouds supporting data intensive computing.  Code to generate synthetic data required is available from code.google.com/p/malgen  Stylized analytic computation that is easy to implement in MapReduce and its generalizations. 26
  • 27. MalStone B entities sites dk-2 dk-1 dk time 27
  • 28. MalStone B Benchmark MalStone B Hadoop v0.18.3 799 min Hadoop Streamingv0.18.3 142 min Sector v1.19 44 min # Nodes 20 nodes # Records 10 Billion Size of Dataset 1 TB These are preliminary results and we expect these results to change as we improve the implementations of MalStone B. 28
  • 29. Terasort - Sector vsHadoop Performance LAN MAN WAN 1 WAN 2 Number 58 116 178 236 Cores Hadoop 2252 2617 3069 3702 (secs) Sector 1265 1301 1430 1526 (secs) Locations UIC UIC, SL UIC, SL, UIC, SL, Calit2 Calit2, JHU All times in seconds.
  • 30. With Sector, “Wide Area Penalty” < 5%  Used Open Cloud Testbed.  And wide area 10 Gb/sec networks.  Ran a data intensive computing benchmark on 4 clusters distributed across the U.S. vs one cluster in Chicago.  Difference in performance less than 5% for Terasort.  One expects quite different results, depending upon the particular computation. 30
  • 31. Penalty for Wide Area Cloud Computing on Uncongested 10 Gb/s 28 Local 4x 7 distributed Wide Area Nodes Nodes “Penality” Hadoop 3 8650 11600 34% replicas Hadoop 1 7300 9600 31% replica Sector 4200 4400 4.7% All times in seconds usingMalStoneA benchmark on Open Cloud Testbed. 31
  • 32. For More Information & To Obtain Sector  To obtain Sector or learn more about it: sector.sourceforge.net  To learn more about the Open Cloud Consortium www.opencloudconsortium.org  For related work by Robert Grossman blog.rgrossman.com, www.rgrossman.com  For related work by YunhongGu www.lac.uic.edu/~yunhong 32