Using Cascalog to build an app with City of Palo Alto Open DataOSCON Byrum
Mais conteúdo relacionado
Semelhante a ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration
Using Cascalog to build an app with City of Palo Alto Open DataOSCON Byrum
Semelhante a ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration (20)
Using Cascalog to build an app with City of Palo Alto Open Data
ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration
1. vHadoop: A Scalable Hadoop Virtual Cluster
Platform for MapReduce-Based Parallel Mac
hine Learning with Performance Considerati
on
Kejiang Ye, Xiaohong Jiang, Yanzhang He, Xiang Li, Haiming Yan, Peng Huang
CCNT Lab, College of Computer Science
Zhejiang University, China
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
2. Outline
Motivations
vHadoop Platform
System Architecture & Flow
Platform Design & Implementation
Performance Analysis of vHadoop
Static Performance Analysis
Dynamic Performance Analysis
Parallel Machine Learning on vHadoop
MapReduce-based Clustering Algorithms
Clustering on “Synthetic Control Chart Time Series” Data
Set
Visualizing Sample Clustering
Related Work
Conclusion & Future Work PQoSCom’12
Cluster 2012 Workshop:
Sep. 28, 2012 Beijing, China
3. Outline
Motivations
vHadoop Platform
System Architecture & Flow
Platform Design & Implementation
Performance Analysis of vHadoop
Static Performance Analysis
Dynamic Performance Analysis
Parallel Machine Learning on vHadoop
MapReduce-based Clustering Algorithms
Clustering on “Synthetic Control Chart Time Series” Data
Set
Visualizing Sample Clustering
Related Work
Conclusion & Future Work PQoSCom’12
Cluster 2012 Workshop:
Sep. 28, 2012 Beijing, China
4. Motivations
Big data processing is currently becoming
increasingly important in modern era due t
o the continuous growth of the amount of
data generated by various fields such as p
article physics, human genomics, earth ob
servation, etc.
However, the efficiency of processing larg
e-scale on modern virtual infrastructur
e, especially on the virtualized cloud comp
uting infrastructure, is not clear.
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
5. Motivations
As the cloud computing becomes more and more
mature, big data processing on virtual infrastru
cture will become more and more common:
Big data processing with high efficiency is a big challe
nge which needs to be executed on distributed platfor
ms in parallel.
In the cloud era, resource virtualization is a typical fea
ture that most of tasks will be executed on the virtual i
nfrastructure.
Virtualization holds many other benefits such as rapid
startup, dynamic configuration, high scalability, etc.
Moving data to computing resources is more expensi
ve than moving computing resources (such as VM) to
data due to the high overheads of transferring large a
mounts data.Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
6. Contributions
Propose a scalable hadoop virtual cluster pla
tform vHadoop for the large-scale MapRedu
ce-based parallel data processing with perfor
mance consideration.
Perform a series of experiments to investigat
e the static and dynamic performance of vHa
doop.
Use the vHadoop platform to process several
typical parallel clustering tasks, including Can
opy, Dirichlet, Fuzzy k-Means, MeanShift, Mi
nHash, on two datasets.
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
7. Outline
Motivations
vHadoop Platform
System Architecture & Flow
Platform Design & Implementation
Performance Analysis of vHadoop
Static Performance Analysis
Dynamic Performance Analysis
Parallel Machine Learning on vHadoop
MapReduce-based Clustering Algorithms
Clustering on “Synthetic Control Chart Time Series” Data
Set
Visualizing Sample Clustering
Related Work
Conclusion & Future Work PQoSCom’12
Cluster 2012 Workshop:
Sep. 28, 2012 Beijing, China
8. vHadoop Platform
System Architecture & Flow
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
10. Outline
Motivations
vHadoop Platform
System Architecture & Flow
Platform Design & Implementation
Performance Analysis of vHadoop
Static Performance Analysis
Dynamic Performance Analysis
Parallel Machine Learning on vHadoop
MapReduce-based Clustering Algorithms
Clustering on “Synthetic Control Chart Time Series” Data
Set
Visualizing Sample Clustering
Related Work
Conclusion & Future Work PQoSCom’12
Cluster 2012 Workshop:
Sep. 28, 2012 Beijing, China
11. Performance Analysis of vHad
oop
Experimental Configuration
Hadoop Virtual Cluster Configuration
Dell T710 Server, with 2 Quad-core 64bit Xeon processors and 32
GB DRAM.
CentOS 5.6 with kernel version 2.6.18-238.12.1.e15xen in Domain
0, and Xen 3.3.1 as the hypervisor.
VM (Guest OS) with Ubuntu 8.10, 1 VCPU & 1024 MB vMemory.
Hadoop version is 0.20.2
Mahout version is 0.6
All the VM images are stored on a separate NFS storage server
MapReduce-based Benchmarks
Wordcount
MRBench
TeraSort
TestDFSIO
Live Migration Benchmark
Virt-LM [Huang et al., ICPE’11]
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
12. Performance Analysis of vHad
oop
Static Performance Analysis
Wordcount
MRBench
Network
TeraSort DFSIO communication
overheads become the
main bottleneck in the
cross-domain
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
13. Performance Analysis of vHad
oop
Dynamic Performance Analysis
Live migration of hadoop virtual
cluster incurs some
overheads, especially the
downtime.
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
14. Outline
Motivations
vHadoop Platform
System Architecture & Flow
Platform Design & Implementation
Performance Analysis of vHadoop
Static Performance Analysis
Dynamic Performance Analysis
Parallel Machine Learning on vHadoop
MapReduce-based Clustering Algorithms
Clustering on “Synthetic Control Chart Time Series” Data
Set
Visualizing Sample Clustering
Related Work
Conclusion & Future Work PQoSCom’12
Cluster 2012 Workshop:
Sep. 28, 2012 Beijing, China
15. Parallel Machine Learning on vHad
oop
MapReduce-based Clustering Algorithms
Canopy Clustering is a very simple, fast and accurate method for group
ing objects into clusters. All objects are represented as a point in a multid
imensional feature space. Canopy Clustering is often used as an initial st
ep in more rigorous clustering techniques, such as K-Means Clustering.
k-Means Clustering is a rather simple but well known algorithm for grou
ping objects. All objects need to be represented as a set of numerical fea
tures. In addition, the user has to specify the number of groups (referred
to as k) he/she wishes to identify.
Fuzzy k-Means Clustering is an extension of K-Means, the popular sim
ple clustering technique. While K-Means discovers hard clusters (a point
belong to only one cluster),
Fuzzy K-Means is a more statistically formalized method and discovers
soft clusters where a particular point can belong to more than one cluster
with certain probability.
Mean Shift Clustering produces arbitrarily-shaped clusters depending u
pon the topology of the data without a priori knowledge of the number of
clusters (as required in KMeans).
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
16. Parallel Machine Learning on vHad
oop
Clustering on “Synthetic Control Chart
Time Series” Data Set
1 namenode + 1
datanode
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
18. Parallel Machine Learning on vHad
oop
Visualizing Sample Clustering
Sample Data
Canopy Dirichlet
Fuzzy k-Means Cluster 2012 Workshop: PQoSCom’12 MeanShift
k-Means
Sep. 28, 2012 Beijing, China
19. Outline
Motivations
vHadoop Platform
System Architecture & Flow
Platform Design & Implementation
Performance Analysis of vHadoop
Static Performance Analysis
Dynamic Performance Analysis
Parallel Machine Learning on vHadoop
MapReduce-based Clustering Algorithms
Clustering on “Synthetic Control Chart Time Series” Data
Set
Visualizing Sample Clustering
Related Work
Conclusion & Future Work PQoSCom’12
Cluster 2012 Workshop:
Sep. 28, 2012 Beijing, China
20. Related Work
Virtualization technology
Performance characterization of virtualization, inc
luding performance evaluation [Cherkasova et al., USENIX’05;
Ye et al., IJNAM’12], performance modeling [Tickoo et al., SIGMET
RICS’10; Kundu et al., HPCA’10; Ye et al., HPCC’10], and performance
optimization [Menon et al., USENIX’06; Ongaro et al., VEE’08].
Server consolidation [Apparao et al., VEE’08], Live Migratio
n [Voorsluys et al., CloudCom’09]
MapReduce technology
Performance of Hadoop [Kambatla et al., HotCloud’09]
MapReduce in VM [Ibrahim et al., ICPP’11; Zaharia et al., USENIX’08]
However, they didn’t refer to the dynamic performance, i.e. live migration of hadoop
virtual cluster. Further, they didn’t refer to the PQoSCom’12 parallel machine learning
Cluster 2012 Workshop: problem of
on the hadoop virtual cluster which is becomingChina
Sep. 28, 2012 Beijing, increasing important in the big data
21. Outline
Motivations
vHadoop Platform
System Architecture & Flow
Platform Design & Implementation
Performance Analysis of vHadoop
Static Performance Analysis
Dynamic Performance Analysis
Parallel Machine Learning on vHadoop
MapReduce-based Clustering Algorithms
Clustering on “Synthetic Control Chart Time Series” Data
Set
Visualizing Sample Clustering
Related Work
Conclusion & Future Work PQoSCom’12
Cluster 2012 Workshop:
Sep. 28, 2012 Beijing, China
22. Conclusion
We proposed a scalable hadoop virtual clu
ster platform vHadoop for the parallel mac
hine learning with performance considerati
on.
And investigated both the static and dyna
mic performance of vHadoop.
Also verified the performance and efficien
cy of running MapReduce-based parallel
machine learning applications on vHadoop
platform. Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China
23. Conclusion
Experimental results show that
The network I/O and NFS disk I/O are two main bottleneck
s of vHadoop platform due to the shared resource contenti
on and interference. The poor I/O performance in virtualiza
tion system and the heavy network communication operati
ons in hadoop system make the network as the main perfo
rmance bottleneck.
There is a performance degradation when the data size or
cluster scale increases. The cross-domain distribution of h
adoop virtual cluster will also affect the communication per
formance of vHadoop.
The vHadoop can perform the live migration of hadoop virt
ual cluster successfully. Although the service is unavailabl
e in the period of downtime, the hadoop fault tolerance me
chanism will re-run the job or restore from other available
backup data.
The vHadoop Cluster 2012 Workshop: PQoSCom’12 to run the MapR
platform is efficient enough
educed-based parallel machineChina
Sep. 28, 2012 Beijing, learning algorithms on rea
24. Future Work
Integrate the vHadoop platform to open so
urce cloud computing system to provide s
calable on-demand computation service fo
r processing data-intensive (or big data) a
pplications with parallel machine learning
algorithms.
Cluster 2012 Workshop: PQoSCom’12
Sep. 28, 2012 Beijing, China