ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

vHadoop: A Scalable Hadoop Virtual Cluster
Platform for MapReduce-Based Parallel Mac
hine Learning with Performance Considerati
on

Kejiang Ye, Xiaohong Jiang, Yanzhang He, Xiang Li, Haiming Yan, Peng Huang
CCNT Lab, College of Computer Science
Zhejiang University, China

Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China

Outline
 Motivations
 vHadoop Platform
 System Architecture & Flow
 Platform Design & Implementation
 Performance Analysis of vHadoop
 Static Performance Analysis
 Dynamic Performance Analysis
 Parallel Machine Learning on vHadoop
 MapReduce-based Clustering Algorithms
 Clustering on “Synthetic Control Chart Time Series” Data
Set
 Visualizing Sample Clustering
 Related Work
 Conclusion & Future Work PQoSCom’12
Cluster 2012 Workshop:

Motivations
 Big data processing is currently becoming
increasingly important in modern era due t
o the continuous growth of the amount of
data generated by various fields such as p
article physics, human genomics, earth ob
servation, etc.
 However, the efficiency of processing larg
e-scale on modern virtual infrastructur
e, especially on the virtualized cloud comp
uting infrastructure, is not clear.

Motivations
 As the cloud computing becomes more and more
mature, big data processing on virtual infrastru
cture will become more and more common:
 Big data processing with high efficiency is a big challe
nge which needs to be executed on distributed platfor
ms in parallel.
 In the cloud era, resource virtualization is a typical fea
ture that most of tasks will be executed on the virtual i
nfrastructure.
 Virtualization holds many other benefits such as rapid
startup, dynamic configuration, high scalability, etc.
 Moving data to computing resources is more expensi
ve than moving computing resources (such as VM) to
data due to the high overheads of transferring large a
mounts data.Cluster 2012 Workshop: PQoSCom’12

Contributions
 Propose a scalable hadoop virtual cluster pla
tform vHadoop for the large-scale MapRedu
ce-based parallel data processing with perfor
mance consideration.
 Perform a series of experiments to investigat
e the static and dynamic performance of vHa
doop.
 Use the vHadoop platform to process several
typical parallel clustering tasks, including Can
opy, Dirichlet, Fuzzy k-Means, MeanShift, Mi
nHash, on two datasets.

vHadoop Platform
 System Architecture & Flow


vHadoop Platform
 Platform Design & Implementation
 Virtualization Module
 Hadoop Module

 Machine Learning Algorithm Library

 Nmon Monitor

 MapReduce Tunner


Performance Analysis of vHad
oop
 Experimental Configuration
 Hadoop Virtual Cluster Configuration
 Dell T710 Server, with 2 Quad-core 64bit Xeon processors and 32
GB DRAM.
 CentOS 5.6 with kernel version 2.6.18-238.12.1.e15xen in Domain
0, and Xen 3.3.1 as the hypervisor.
 VM (Guest OS) with Ubuntu 8.10, 1 VCPU & 1024 MB vMemory.
 Hadoop version is 0.20.2
 Mahout version is 0.6
 All the VM images are stored on a separate NFS storage server
 MapReduce-based Benchmarks
 Wordcount
 MRBench
 TeraSort
 TestDFSIO
 Live Migration Benchmark
 Virt-LM [Huang et al., ICPE’11]

oop
 Static Performance Analysis
Wordcount
MRBench

Network
TeraSort DFSIO communication
overheads become the
main bottleneck in the
cross-domain

oop
 Dynamic Performance Analysis

Live migration of hadoop virtual
cluster incurs some
overheads, especially the
downtime.


Parallel Machine Learning on vHad
oop
 MapReduce-based Clustering Algorithms
 Canopy Clustering is a very simple, fast and accurate method for group
ing objects into clusters. All objects are represented as a point in a multid
imensional feature space. Canopy Clustering is often used as an initial st
ep in more rigorous clustering techniques, such as K-Means Clustering.
 k-Means Clustering is a rather simple but well known algorithm for grou
ping objects. All objects need to be represented as a set of numerical fea
tures. In addition, the user has to specify the number of groups (referred
to as k) he/she wishes to identify.
 Fuzzy k-Means Clustering is an extension of K-Means, the popular sim
ple clustering technique. While K-Means discovers hard clusters (a point
belong to only one cluster),
 Fuzzy K-Means is a more statistically formalized method and discovers
soft clusters where a particular point can belong to more than one cluster
with certain probability.
 Mean Shift Clustering produces arbitrarily-shaped clusters depending u
pon the topology of the data without a priori knowledge of the number of
clusters (as required in KMeans).


oop
 Clustering on “Synthetic Control Chart
Time Series” Data Set

1 namenode + 1
datanode


oop
 Visualizing Sample Clustering

Canopy Dirichlet Fuzzy k-Means

k-Means MeanShift MinHash


oop
 Visualizing Sample Clustering

Sample Data
Canopy Dirichlet

Fuzzy k-Means Cluster 2012 Workshop: PQoSCom’12 MeanShift
k-Means

Related Work
 Virtualization technology
 Performance characterization of virtualization, inc
luding performance evaluation [Cherkasova et al., USENIX’05;
Ye et al., IJNAM’12], performance modeling [Tickoo et al., SIGMET
RICS’10; Kundu et al., HPCA’10; Ye et al., HPCC’10], and performance
optimization [Menon et al., USENIX’06; Ongaro et al., VEE’08].
 Server consolidation [Apparao et al., VEE’08], Live Migratio
n [Voorsluys et al., CloudCom’09]
 MapReduce technology
 Performance of Hadoop [Kambatla et al., HotCloud’09]
 MapReduce in VM [Ibrahim et al., ICPP’11; Zaharia et al., USENIX’08]
However, they didn’t refer to the dynamic performance, i.e. live migration of hadoop
virtual cluster. Further, they didn’t refer to the PQoSCom’12 parallel machine learning
Cluster 2012 Workshop: problem of
on the hadoop virtual cluster which is becomingChina
Sep. 28, 2012 Beijing, increasing important in the big data

Conclusion
 We proposed a scalable hadoop virtual clu
ster platform vHadoop for the parallel mac
hine learning with performance considerati
on.
 And investigated both the static and dyna
mic performance of vHadoop.
 Also verified the performance and efficien
cy of running MapReduce-based parallel
machine learning applications on vHadoop
platform. Cluster 2012 Workshop: PQoSCom’12

Conclusion
 Experimental results show that
 The network I/O and NFS disk I/O are two main bottleneck
s of vHadoop platform due to the shared resource contenti
on and interference. The poor I/O performance in virtualiza
tion system and the heavy network communication operati
ons in hadoop system make the network as the main perfo
rmance bottleneck.
 There is a performance degradation when the data size or
cluster scale increases. The cross-domain distribution of h
adoop virtual cluster will also affect the communication per
formance of vHadoop.
 The vHadoop can perform the live migration of hadoop virt
ual cluster successfully. Although the service is unavailabl
e in the period of downtime, the hadoop fault tolerance me
chanism will re-run the job or restore from other available
backup data.
 The vHadoop Cluster 2012 Workshop: PQoSCom’12 to run the MapR
platform is efficient enough
educed-based parallel machineChina
Sep. 28, 2012 Beijing, learning algorithms on rea

Future Work
 Integrate the vHadoop platform to open so
urce cloud computing system to provide s
calable on-demand computation service fo
r processing data-intensive (or big data) a
pplications with parallel machine learning
algorithms.


Q&A
Thank you!


ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Semelhante a ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration (20)

ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration