SlideShare uma empresa Scribd logo
1 de 25
vHadoop: A Scalable Hadoop Virtual Cluster
   Platform for MapReduce-Based Parallel Mac
   hine Learning with Performance Considerati
                       on

Kejiang Ye, Xiaohong Jiang, Yanzhang He, Xiang Li, Haiming Yan, Peng Huang
             CCNT Lab, College of Computer Science
                   Zhejiang University, China



                        Cluster 2012 Workshop: PQoSCom’12
                              Sep. 28, 2012 Beijing, China
Outline
   Motivations
   vHadoop Platform
       System Architecture & Flow
       Platform Design & Implementation
   Performance Analysis of vHadoop
       Static Performance Analysis
       Dynamic Performance Analysis
   Parallel Machine Learning on vHadoop
       MapReduce-based Clustering Algorithms
       Clustering on “Synthetic Control Chart Time Series” Data
        Set
       Visualizing Sample Clustering
   Related Work
   Conclusion & Future Work PQoSCom’12
                 Cluster 2012 Workshop:
                          Sep. 28, 2012 Beijing, China
Outline
   Motivations
   vHadoop Platform
       System Architecture & Flow
       Platform Design & Implementation
   Performance Analysis of vHadoop
       Static Performance Analysis
       Dynamic Performance Analysis
   Parallel Machine Learning on vHadoop
       MapReduce-based Clustering Algorithms
       Clustering on “Synthetic Control Chart Time Series” Data
        Set
       Visualizing Sample Clustering
   Related Work
   Conclusion & Future Work PQoSCom’12
                 Cluster 2012 Workshop:
                          Sep. 28, 2012 Beijing, China
Motivations
 Big data processing is currently becoming
  increasingly important in modern era due t
  o the continuous growth of the amount of
  data generated by various fields such as p
  article physics, human genomics, earth ob
  servation, etc.
 However, the efficiency of processing larg
  e-scale on modern virtual infrastructur
  e, especially on the virtualized cloud comp
  uting infrastructure, is not clear.
               Cluster 2012 Workshop: PQoSCom’12
                  Sep. 28, 2012 Beijing, China
Motivations
   As the cloud computing becomes more and more
    mature, big data processing on virtual infrastru
    cture will become more and more common:
       Big data processing with high efficiency is a big challe
        nge which needs to be executed on distributed platfor
        ms in parallel.
       In the cloud era, resource virtualization is a typical fea
        ture that most of tasks will be executed on the virtual i
        nfrastructure.
       Virtualization holds many other benefits such as rapid
        startup, dynamic configuration, high scalability, etc.
       Moving data to computing resources is more expensi
        ve than moving computing resources (such as VM) to
        data due to the high overheads of transferring large a
        mounts data.Cluster 2012 Workshop: PQoSCom’12
                          Sep. 28, 2012 Beijing, China
Contributions
 Propose a scalable hadoop virtual cluster pla
  tform vHadoop for the large-scale MapRedu
  ce-based parallel data processing with perfor
  mance consideration.
 Perform a series of experiments to investigat
  e the static and dynamic performance of vHa
  doop.
 Use the vHadoop platform to process several
  typical parallel clustering tasks, including Can
  opy, Dirichlet, Fuzzy k-Means, MeanShift, Mi
  nHash, on two datasets.
               Cluster 2012 Workshop: PQoSCom’12
                     Sep. 28, 2012 Beijing, China
Outline
   Motivations
   vHadoop Platform
       System Architecture & Flow
       Platform Design & Implementation
   Performance Analysis of vHadoop
       Static Performance Analysis
       Dynamic Performance Analysis
   Parallel Machine Learning on vHadoop
       MapReduce-based Clustering Algorithms
       Clustering on “Synthetic Control Chart Time Series” Data
        Set
       Visualizing Sample Clustering
   Related Work
   Conclusion & Future Work PQoSCom’12
                 Cluster 2012 Workshop:
                          Sep. 28, 2012 Beijing, China
vHadoop Platform
   System Architecture & Flow




              Cluster 2012 Workshop: PQoSCom’12
                    Sep. 28, 2012 Beijing, China
vHadoop Platform
   Platform Design & Implementation
     Virtualization Module
     Hadoop Module

     Machine Learning Algorithm Library

     Nmon Monitor

     MapReduce Tunner




                Cluster 2012 Workshop: PQoSCom’12
                      Sep. 28, 2012 Beijing, China
Outline
   Motivations
   vHadoop Platform
       System Architecture & Flow
       Platform Design & Implementation
   Performance Analysis of vHadoop
       Static Performance Analysis
       Dynamic Performance Analysis
   Parallel Machine Learning on vHadoop
       MapReduce-based Clustering Algorithms
       Clustering on “Synthetic Control Chart Time Series” Data
        Set
       Visualizing Sample Clustering
   Related Work
   Conclusion & Future Work PQoSCom’12
                 Cluster 2012 Workshop:
                          Sep. 28, 2012 Beijing, China
Performance Analysis of vHad
oop
   Experimental Configuration
       Hadoop Virtual Cluster Configuration
            Dell T710 Server, with 2 Quad-core 64bit Xeon processors and 32
             GB DRAM.
            CentOS 5.6 with kernel version 2.6.18-238.12.1.e15xen in Domain
             0, and Xen 3.3.1 as the hypervisor.
            VM (Guest OS) with Ubuntu 8.10, 1 VCPU & 1024 MB vMemory.
            Hadoop version is 0.20.2
            Mahout version is 0.6
            All the VM images are stored on a separate NFS storage server
       MapReduce-based Benchmarks
            Wordcount
            MRBench
            TeraSort
            TestDFSIO
       Live Migration Benchmark
            Virt-LM [Huang et al., ICPE’11]
                             Cluster 2012 Workshop: PQoSCom’12
                                   Sep. 28, 2012 Beijing, China
Performance Analysis of vHad
oop
   Static Performance Analysis
Wordcount
                                                     MRBench




                                                       Network
     TeraSort                          DFSIO           communication
                                                       overheads become the
                                                       main bottleneck in the
                                                       cross-domain
                Cluster 2012 Workshop: PQoSCom’12
                      Sep. 28, 2012 Beijing, China
Performance Analysis of vHad
oop
   Dynamic Performance Analysis




                                              Live migration of hadoop virtual
                                              cluster       incurs       some
                                              overheads,      especially   the
                                              downtime.

             Cluster 2012 Workshop: PQoSCom’12
                   Sep. 28, 2012 Beijing, China
Outline
   Motivations
   vHadoop Platform
       System Architecture & Flow
       Platform Design & Implementation
   Performance Analysis of vHadoop
       Static Performance Analysis
       Dynamic Performance Analysis
   Parallel Machine Learning on vHadoop
       MapReduce-based Clustering Algorithms
       Clustering on “Synthetic Control Chart Time Series” Data
        Set
       Visualizing Sample Clustering
   Related Work
   Conclusion & Future Work PQoSCom’12
                 Cluster 2012 Workshop:
                          Sep. 28, 2012 Beijing, China
Parallel Machine Learning on vHad
oop
   MapReduce-based Clustering Algorithms
       Canopy Clustering is a very simple, fast and accurate method for group
        ing objects into clusters. All objects are represented as a point in a multid
        imensional feature space. Canopy Clustering is often used as an initial st
        ep in more rigorous clustering techniques, such as K-Means Clustering.
       k-Means Clustering is a rather simple but well known algorithm for grou
        ping objects. All objects need to be represented as a set of numerical fea
        tures. In addition, the user has to specify the number of groups (referred
        to as k) he/she wishes to identify.
       Fuzzy k-Means Clustering is an extension of K-Means, the popular sim
        ple clustering technique. While K-Means discovers hard clusters (a point
        belong to only one cluster),
       Fuzzy K-Means is a more statistically formalized method and discovers
        soft clusters where a particular point can belong to more than one cluster
        with certain probability.
       Mean Shift Clustering produces arbitrarily-shaped clusters depending u
        pon the topology of the data without a priori knowledge of the number of
        clusters (as required in KMeans).

                          Cluster 2012 Workshop: PQoSCom’12
                                Sep. 28, 2012 Beijing, China
Parallel Machine Learning on vHad
oop
   Clustering on “Synthetic Control Chart
    Time Series” Data Set




      1 namenode + 1
      datanode


                       Cluster 2012 Workshop: PQoSCom’12
                             Sep. 28, 2012 Beijing, China
Parallel Machine Learning on vHad
oop
   Visualizing Sample Clustering


        Canopy                   Dirichlet            Fuzzy k-Means




       k-Means                   MeanShift            MinHash

                 Cluster 2012 Workshop: PQoSCom’12
                       Sep. 28, 2012 Beijing, China
Parallel Machine Learning on vHad
oop
   Visualizing Sample Clustering



Sample Data
                    Canopy                                 Dirichlet




Fuzzy k-Means   Cluster 2012 Workshop: PQoSCom’12    MeanShift
                            k-Means
                      Sep. 28, 2012 Beijing, China
Outline
   Motivations
   vHadoop Platform
       System Architecture & Flow
       Platform Design & Implementation
   Performance Analysis of vHadoop
       Static Performance Analysis
       Dynamic Performance Analysis
   Parallel Machine Learning on vHadoop
       MapReduce-based Clustering Algorithms
       Clustering on “Synthetic Control Chart Time Series” Data
        Set
       Visualizing Sample Clustering
   Related Work
   Conclusion & Future Work PQoSCom’12
                 Cluster 2012 Workshop:
                          Sep. 28, 2012 Beijing, China
Related Work
     Virtualization technology
          Performance characterization of virtualization, inc
           luding performance evaluation [Cherkasova et al., USENIX’05;
           Ye et al., IJNAM’12], performance modeling [Tickoo et al., SIGMET
           RICS’10; Kundu et al., HPCA’10; Ye et al., HPCC’10], and performance
           optimization [Menon et al., USENIX’06; Ongaro et al., VEE’08].
          Server consolidation [Apparao et al., VEE’08], Live Migratio
           n [Voorsluys et al., CloudCom’09]
     MapReduce technology
          Performance of Hadoop [Kambatla et al., HotCloud’09]
          MapReduce in VM [Ibrahim et al., ICPP’11; Zaharia et al., USENIX’08]
However, they didn’t refer to the dynamic performance, i.e. live migration of hadoop
virtual cluster. Further, they didn’t refer to the PQoSCom’12 parallel machine learning
                             Cluster 2012 Workshop: problem of
on the hadoop virtual cluster which is becomingChina
                                    Sep. 28, 2012 Beijing, increasing important in the big data
Outline
   Motivations
   vHadoop Platform
       System Architecture & Flow
       Platform Design & Implementation
   Performance Analysis of vHadoop
       Static Performance Analysis
       Dynamic Performance Analysis
   Parallel Machine Learning on vHadoop
       MapReduce-based Clustering Algorithms
       Clustering on “Synthetic Control Chart Time Series” Data
        Set
       Visualizing Sample Clustering
   Related Work
   Conclusion & Future Work PQoSCom’12
                 Cluster 2012 Workshop:
                          Sep. 28, 2012 Beijing, China
Conclusion
 We proposed a scalable hadoop virtual clu
  ster platform vHadoop for the parallel mac
  hine learning with performance considerati
  on.
 And investigated both the static and dyna
  mic performance of vHadoop.
 Also verified the performance and efficien
  cy of running MapReduce-based parallel
  machine learning applications on vHadoop
  platform. Cluster 2012 Workshop: PQoSCom’12
                 Sep. 28, 2012 Beijing, China
Conclusion
   Experimental results show that
       The network I/O and NFS disk I/O are two main bottleneck
        s of vHadoop platform due to the shared resource contenti
        on and interference. The poor I/O performance in virtualiza
        tion system and the heavy network communication operati
        ons in hadoop system make the network as the main perfo
        rmance bottleneck.
       There is a performance degradation when the data size or
        cluster scale increases. The cross-domain distribution of h
        adoop virtual cluster will also affect the communication per
        formance of vHadoop.
       The vHadoop can perform the live migration of hadoop virt
        ual cluster successfully. Although the service is unavailabl
        e in the period of downtime, the hadoop fault tolerance me
        chanism will re-run the job or restore from other available
        backup data.
       The vHadoop Cluster 2012 Workshop: PQoSCom’12 to run the MapR
                       platform is efficient enough
        educed-based parallel machineChina
                            Sep. 28, 2012 Beijing, learning algorithms on rea
Future Work
   Integrate the vHadoop platform to open so
    urce cloud computing system to provide s
    calable on-demand computation service fo
    r processing data-intensive (or big data) a
    pplications with parallel machine learning
    algorithms.



                Cluster 2012 Workshop: PQoSCom’12
                      Sep. 28, 2012 Beijing, China
Q&A
Thank you!



Cluster 2012 Workshop: PQoSCom’12
      Sep. 28, 2012 Beijing, China

Mais conteúdo relacionado

Semelhante a ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
inside-BigData.com
 

Semelhante a ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration (20)

Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
ICCT2017: A user mode implementation of filtering rule management plane using...
ICCT2017: A user mode implementation of filtering rule management plane using...ICCT2017: A user mode implementation of filtering rule management plane using...
ICCT2017: A user mode implementation of filtering rule management plane using...
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
 
Grid2012 VC-Migration: Live Migration of Virtual Clusters in the Cloud
Grid2012 VC-Migration: Live Migration of Virtual Clusters in the CloudGrid2012 VC-Migration: Live Migration of Virtual Clusters in the Cloud
Grid2012 VC-Migration: Live Migration of Virtual Clusters in the Cloud
 
Strata + Hadoop World 2012: Knitting Boar
Strata + Hadoop World 2012: Knitting BoarStrata + Hadoop World 2012: Knitting Boar
Strata + Hadoop World 2012: Knitting Boar
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Datascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, SparkDatascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, Spark
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
 
OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made Easy
 
An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)An Introduction of Recent Research on MapReduce (2011)
An Introduction of Recent Research on MapReduce (2011)
 
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
 

ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

  • 1. vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Mac hine Learning with Performance Considerati on Kejiang Ye, Xiaohong Jiang, Yanzhang He, Xiang Li, Haiming Yan, Peng Huang CCNT Lab, College of Computer Science Zhejiang University, China Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 2. Outline  Motivations  vHadoop Platform  System Architecture & Flow  Platform Design & Implementation  Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis  Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering  Related Work  Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  • 3. Outline  Motivations  vHadoop Platform  System Architecture & Flow  Platform Design & Implementation  Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis  Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering  Related Work  Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  • 4. Motivations  Big data processing is currently becoming increasingly important in modern era due t o the continuous growth of the amount of data generated by various fields such as p article physics, human genomics, earth ob servation, etc.  However, the efficiency of processing larg e-scale on modern virtual infrastructur e, especially on the virtualized cloud comp uting infrastructure, is not clear. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 5. Motivations  As the cloud computing becomes more and more mature, big data processing on virtual infrastru cture will become more and more common:  Big data processing with high efficiency is a big challe nge which needs to be executed on distributed platfor ms in parallel.  In the cloud era, resource virtualization is a typical fea ture that most of tasks will be executed on the virtual i nfrastructure.  Virtualization holds many other benefits such as rapid startup, dynamic configuration, high scalability, etc.  Moving data to computing resources is more expensi ve than moving computing resources (such as VM) to data due to the high overheads of transferring large a mounts data.Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 6. Contributions  Propose a scalable hadoop virtual cluster pla tform vHadoop for the large-scale MapRedu ce-based parallel data processing with perfor mance consideration.  Perform a series of experiments to investigat e the static and dynamic performance of vHa doop.  Use the vHadoop platform to process several typical parallel clustering tasks, including Can opy, Dirichlet, Fuzzy k-Means, MeanShift, Mi nHash, on two datasets. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 7. Outline  Motivations  vHadoop Platform  System Architecture & Flow  Platform Design & Implementation  Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis  Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering  Related Work  Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  • 8. vHadoop Platform  System Architecture & Flow Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 9. vHadoop Platform  Platform Design & Implementation  Virtualization Module  Hadoop Module  Machine Learning Algorithm Library  Nmon Monitor  MapReduce Tunner Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 10. Outline  Motivations  vHadoop Platform  System Architecture & Flow  Platform Design & Implementation  Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis  Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering  Related Work  Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  • 11. Performance Analysis of vHad oop  Experimental Configuration  Hadoop Virtual Cluster Configuration  Dell T710 Server, with 2 Quad-core 64bit Xeon processors and 32 GB DRAM.  CentOS 5.6 with kernel version 2.6.18-238.12.1.e15xen in Domain 0, and Xen 3.3.1 as the hypervisor.  VM (Guest OS) with Ubuntu 8.10, 1 VCPU & 1024 MB vMemory.  Hadoop version is 0.20.2  Mahout version is 0.6  All the VM images are stored on a separate NFS storage server  MapReduce-based Benchmarks  Wordcount  MRBench  TeraSort  TestDFSIO  Live Migration Benchmark  Virt-LM [Huang et al., ICPE’11] Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 12. Performance Analysis of vHad oop  Static Performance Analysis Wordcount MRBench Network TeraSort DFSIO communication overheads become the main bottleneck in the cross-domain Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 13. Performance Analysis of vHad oop  Dynamic Performance Analysis Live migration of hadoop virtual cluster incurs some overheads, especially the downtime. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 14. Outline  Motivations  vHadoop Platform  System Architecture & Flow  Platform Design & Implementation  Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis  Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering  Related Work  Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  • 15. Parallel Machine Learning on vHad oop  MapReduce-based Clustering Algorithms  Canopy Clustering is a very simple, fast and accurate method for group ing objects into clusters. All objects are represented as a point in a multid imensional feature space. Canopy Clustering is often used as an initial st ep in more rigorous clustering techniques, such as K-Means Clustering.  k-Means Clustering is a rather simple but well known algorithm for grou ping objects. All objects need to be represented as a set of numerical fea tures. In addition, the user has to specify the number of groups (referred to as k) he/she wishes to identify.  Fuzzy k-Means Clustering is an extension of K-Means, the popular sim ple clustering technique. While K-Means discovers hard clusters (a point belong to only one cluster),  Fuzzy K-Means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability.  Mean Shift Clustering produces arbitrarily-shaped clusters depending u pon the topology of the data without a priori knowledge of the number of clusters (as required in KMeans). Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 16. Parallel Machine Learning on vHad oop  Clustering on “Synthetic Control Chart Time Series” Data Set 1 namenode + 1 datanode Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 17. Parallel Machine Learning on vHad oop  Visualizing Sample Clustering Canopy Dirichlet Fuzzy k-Means k-Means MeanShift MinHash Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 18. Parallel Machine Learning on vHad oop  Visualizing Sample Clustering Sample Data Canopy Dirichlet Fuzzy k-Means Cluster 2012 Workshop: PQoSCom’12 MeanShift k-Means Sep. 28, 2012 Beijing, China
  • 19. Outline  Motivations  vHadoop Platform  System Architecture & Flow  Platform Design & Implementation  Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis  Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering  Related Work  Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  • 20. Related Work  Virtualization technology  Performance characterization of virtualization, inc luding performance evaluation [Cherkasova et al., USENIX’05; Ye et al., IJNAM’12], performance modeling [Tickoo et al., SIGMET RICS’10; Kundu et al., HPCA’10; Ye et al., HPCC’10], and performance optimization [Menon et al., USENIX’06; Ongaro et al., VEE’08].  Server consolidation [Apparao et al., VEE’08], Live Migratio n [Voorsluys et al., CloudCom’09]  MapReduce technology  Performance of Hadoop [Kambatla et al., HotCloud’09]  MapReduce in VM [Ibrahim et al., ICPP’11; Zaharia et al., USENIX’08] However, they didn’t refer to the dynamic performance, i.e. live migration of hadoop virtual cluster. Further, they didn’t refer to the PQoSCom’12 parallel machine learning Cluster 2012 Workshop: problem of on the hadoop virtual cluster which is becomingChina Sep. 28, 2012 Beijing, increasing important in the big data
  • 21. Outline  Motivations  vHadoop Platform  System Architecture & Flow  Platform Design & Implementation  Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis  Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering  Related Work  Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  • 22. Conclusion  We proposed a scalable hadoop virtual clu ster platform vHadoop for the parallel mac hine learning with performance considerati on.  And investigated both the static and dyna mic performance of vHadoop.  Also verified the performance and efficien cy of running MapReduce-based parallel machine learning applications on vHadoop platform. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 23. Conclusion  Experimental results show that  The network I/O and NFS disk I/O are two main bottleneck s of vHadoop platform due to the shared resource contenti on and interference. The poor I/O performance in virtualiza tion system and the heavy network communication operati ons in hadoop system make the network as the main perfo rmance bottleneck.  There is a performance degradation when the data size or cluster scale increases. The cross-domain distribution of h adoop virtual cluster will also affect the communication per formance of vHadoop.  The vHadoop can perform the live migration of hadoop virt ual cluster successfully. Although the service is unavailabl e in the period of downtime, the hadoop fault tolerance me chanism will re-run the job or restore from other available backup data.  The vHadoop Cluster 2012 Workshop: PQoSCom’12 to run the MapR platform is efficient enough educed-based parallel machineChina Sep. 28, 2012 Beijing, learning algorithms on rea
  • 24. Future Work  Integrate the vHadoop platform to open so urce cloud computing system to provide s calable on-demand computation service fo r processing data-intensive (or big data) a pplications with parallel machine learning algorithms. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  • 25. Q&A Thank you! Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China