Thoughts about how to use CloudStack to develop a Big Data Solution in your data center.
Cloud as a virtual machine infrastructure and Big Data are converging as two key technical evolution in the data center. Virtualization enables multi-tenancy, heterogenous Operating systems and added security via isolation. Clouds like AWS EC2, Rackspace, or google GCE are good examples. Big Data tackles the challenge of the increase scale (amount) and complexity (type) of data faced in the enterprise. While Compute cloud show a departure from traditional hardware provisioning and configuration management via virtualization, big data is a departure from traditional relational databases and file systems. These two technical evolutions have been triggered by the new workloads of the internet (search, streaming) and the scale needed to server millions of users and millions/billions of objects to store or serve.
In this talk we show how CloudStack and its support for bare-metal provisioning is compatible with a public cloud. CloudStack being a data center orchestrator that can tackle both traditional enterprise workloads and internet scale/type workloads. Multiple zones can be created for compute cloud or big data. Big data can used as backend store to the compute cloud or as zone type to enabled big data workload on the bare metal hardware.
This hybrid mode of operation is seen as the next evolution of clouds and positions a data center orchestrator has more than a VM management system and a solution to big data management as well.
5. What is Big Data ?
⢠Large scale datasets
â From scientific instruments
â From Web apps logs
â From Health recordsâŚ
⢠Complex datasets
â Not necessarily large.
â E.g Unstructured data
â E.g Natural Language
â E.g IBM Watson
6. A natural
evolution
⢠From traditional file systems and databases
⢠To large scale object store and nosql
movement designed to handle massive scale
and concurrency
7. BigData
and map-reduce
⢠While BigData is often associated with HDFS,
Map-Reduce is the algorithm used to
parallelize data processing.
⢠BigData â Map-Reduce â HDFS
⢠Map-reduce is a way to express
embarrassingly parallel work easily.
⢠You can do Map-Reduce without HDFS.
⢠E.g Basho map-reduce on riackCS
10. IaaS is really:
⢠A Data Center Orchestrator
â Data storage
â Data movement
â Data processing
⢠That can:
â Handle failures
â Support large scale
â Be programmed
11. What is CloudStack ?
⢠Open source Infrastructure as a Service (IaaS)
solution.
⢠âProgrammableâ Data Center orchestrator
⢠Hypervisor agnostic (with addition of bare
metal provisioning)
⢠Support scalable storage (Ceph, RIAK CSâŚ)
⢠Support complex enterprise networking (e.g
Firewall, load balancer, VPN, VPCâŚ)
⢠Multi-tenant
12. A bit of History
⢠Original company VMOPs (2008)
â Founded by Sheng Liang former lead dev on JVM
⢠Open source (GPLv3) as CloudStack
⢠Acquired by Citrix (July 2011)
⢠Relicensed under ASL v2 April 3, 2012
⢠Accepted as Apache Incubating Project April
16, 2012
⢠First Apache (ACS 4.0) released november
2012
⢠Top Level Project Since March 2013.
13. Why ASF ?
⢠Open Sourced CloudStack to:
â Build a community
â Facilitate the building of an ecosystem
â Faster time to market
⢠ASF highly recognized OSS foundation.
⢠ASF clear processes
⢠Individual contributions, companies have no
standing
16. Multiple
ContributorsSungard: Announced last
week that 6 developers
were joining the Apache
project
Schuberg Philis: Big
contribution in
building/packaging and
Nicira support
Go Daddy: Maven
building
Caringo: Support for own
object store
Basho: Support for
RiackCS
22. CloudStack and BigData
⢠Apache CloudStack is a data center
orchestrator
⢠BigData solutions as storage backends for
image catalogue and large scale instance
storage.
⢠BigData solutions as workloads to CloudStack
based clouds.
23. Storage
⢠Primary Storage:
â Anything that can be mounted on the node of a
cluster.
â Cluster LVM, iSCSI, NFS, Ceph
â Holds disk images of running VMs and user block
stores.
⢠Secondary Storage:
â Available across the zone
â Holds snapshots and templates (image repo)
â Can use multiple object stores (Gluster , Ceph,
riackCS, Swift, Caringo )
24. Big Data
and CloudStack
⢠âBig Dataâ solutions can be used as
secondary storage (OpenStack swift, Caringo,
CephFS, Gluster FS, RiackCSâŚ).
⢠Used to deploy a large scale storage backend
to manage user images, and user data
volumes.
⢠Primary intent is not to use it inside the VMs
for data processing.
25. CloudStack
and Baremetal
⢠CS supports baremetal provisioning.
⢠This opens the door to multiple scenarios for
Big-Data store, Clouds
â Provision Hadoop cluster on baremetal
â Operate âHybridâ cloud: part Hypervisor for VM
provisioning, part baremetal for data store.
â Reconfigure entire cloud on-demand
27. âBare Metalâ Hybrid
deployment
⢠Set of hypervisors, stand-alone secondary
storage, bare metal cluster with specialized
hardware or software.
⢠Access Big-Data store from VM guests
28. âBare metalâ cluster as secondary
storage
⢠Use bare-metal provisioning to manage
larges-scale secondary storage
29. âPureâ Big-
Data store⢠Use CS as a traditional data center
provisioning system and build a Big-Data store
on-demand
30. Combinations
⢠CloudStack offers the possibility to switch
between these modes on-demand
⢠An elastic reconfigurable cloud
⢠Just be careful not to override your data ď
31. Big Data as a Workload to the
Cloud
tools and demoâŚ
32. Apache
Whirr
⢠Big Data Provisioning
tool
⢠Deploys Hadoop, cdh,
Hbase, Yarn, etc in the
Cloud
⢠Use jclouds
⢠Works with multiple
cloud providers
including CloudStack
37. Others: Pallet
⢠Clojure based
provisioning tool
⢠Provisions Hadoop
clusters in the cloud.
⢠Equivalent to Whirr but
in clojure
38. CloStack
⢠Clojure client for
CloudStack
⢠Uses native CloudStack
API
⢠Developed by @pyr at
exoscale.ch , a
CloudStack based public
cloud providers
40. On-Going Big-Data
development
⢠Hadoop being an Apache project written in
Java, there is great potential synergy between
CloudStack and Hadoop:
e.g Develop Elastic Map-Reduce mechanisms to
provide map-reduce processing in CS backed by
HDFS. Implementation of AWS EMR API.
⢠Integration of Basho map-reduce (coming in
4.2 release)
41. GSoC
⢠ASF is a mentoring organization for GSoC
⢠CloudStack has several proposals under
consideration
â Improved CloudStack support in Apache Whirr and
Provisionr
â Integration of Apache Mesos with CloudStack
42. Info
⢠Apache Top Level project
⢠http://www.cloudstack.org
⢠#cloudstack on irc.freenode.net
⢠@cloudstack on Twitter
⢠http://www.slideshare.net/cloudstack
⢠http://cloudstack.apache.org/mailing-lists.html
Welcoming contributions and feedback, Join the fun !