Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object Storage
1. Cloudian®
S3 Cloud Storage Platform
Case Study:
Implementing Hadoop and Elastic Map
Reduce on Scale-out Object Storage
Paul Turner
Cloudian Inc.
June 11th 2014
2. About Cloudian
• Hybrid cloud storage startup in Silicon Valley
– Strong venture backing: Goldman Sachs, Intel Capital
– Solid management with storage, big data, enterprise software and telco
expertise
– 50 employees, offices in Foster City, Japan and China
• Production hardened product
• Target market: mid- to large-enterprises & regional service providers
• GTM: traditional storage distribution/VARs
CLOUDIAN PARTNERS
3. The Challenge
• Business problem = Analysis of log data from our
customer systems to improve support (classic
‘Internet of Things’ content)
• Existing system required transformation of the data
into HDFS for analytics (slow and costly)
Goal : Reduce cost and provide faster results
6/16/2014 3
4. Use Case : Support Analytics
• Compare system statistics and usage
patterns to previous normal results
6/16/2014 4
Abnormal Operations
Analysis
End User Analysis
to root cause issues
Trend Analysis for
Capacity Planning and
Traffic Patterns
• Identify all operations for a particular user
and review patterns and any faults
• Build capacity and traffic trend lines based
on statistical analysis of all traffic
100tps S3 Server = 83million lines info log = 3.5GB/Day
10 Server System = 35GB/Day ~ 1TB/month
100 Customer Systems => 1.2PB Annually
5. Traditional Big Data Flow
Event Processing
Platform
Big Data Storage Platform
Analytics PlatformContent Storage
Consumer Activity
(Events, GPS, WiFi)
Social MediaDevice Tracking and Logs
(Event, Configuration, Usage, Performance, )
Real Time
Events
Big Data
Result of analysis
6/16/2014 5
6. Traditional Big Data Flow
Event Processing
Platform
Analytics Platform
(HDFS)Content
Storage
(Object, NAS)
• Wasted storage = storage for content and analytics
• Transform of data into HDFS can be costly
• High overhead of HDFS (3copy replica) for content which may
be poor quality
Logs, Config
6/16/2014 6
7. S3 and Hadoop
• Apache Hadoop supports S3 since Jan 2008
– http://wiki.apache.org/hadoop/AmazonS3
• Well-proven by Amazon with Elastic MapReduce
• State-of-the-art and advancing quickly to provide
much easier Hadoop over S3 – e.g. Netflix Genie
– https://github.com/Netflix/genie
6/16/2014 7
8. Cloudian Approach
Event Processing
Platform
AnalyticsCloudian HyperStore
Storage
• No redundant storage of data
• Hyperstore scales out with your data – adding nodes for I/O
• Analyze more - allows for efficient bulk data analysis in place
• Take advantage of multi-core CPUs – makes sense for MapReduce
• Can feed smarter data for subsequent analytic systems
• Faster time to decision
6/16/2014 8
9. Cloudian Hadoop Configuration
• Hadoop 2.2
• Configured for native S3 file system (etc/hadoop/core-site.xml)
– S3N native file system for reading and writing regular files on S3. The
advantage of this file system is that you can access files on S3 that were
written with other tools. Conversely, other tools can access files written using
Hadoop.
• Configure Hadoop to use Cloudian (etc/hadoop/jets3t.properties)
– s3service.s3-endpoint=CLOUDIAN_ENDPOINT
– s3service.s3-endpoint-http-port=CLOUDIAN_PORT
6/16/2014 9
Note: you can also dedicate a bucket for Hadoop analytics and then
Hadoop will chunk the content into blocks for storage – like HDFS
10. S3
NFS
Cloudian HyperStore® Software
Scalable peer-to-peer architecture
Multi-data center replication
Multi-Tenancy and Chargeback
Hybrid cloud-ready (any S3 cloud)
100s of supported applications
Optimized for any workload
Storage for OpenStack & CloudStack
6/16/2014 10
11. Elastic, Distributed and Reliable
NOSQL database distributes
and replicates data
Logical Ring
Data is
automatically
replicated to
multiple nodes.
Location of data can be
designated, for instance, to
multiple datacenters and
per rack.
DC1
DC2
In theory, # of nodes in
a logical ring can be up
to 2127 (almost infinite).
Data load can be
rebalanced when a node is
added or removed.
Jun-14
116/16/2014
12. Enhanced HyperStore® Technology
• Policies tailored for different
object types
• Optimized for all data
• Chunking for better
performance
• Erasure Coding for deep
archive efficiency
• Reliable storage across
multi-node failures
HyperStore
Patent Pending
Small Objects
Large Objects
Active Content
File System
NOSQL DB
Erasure Coding
Deep
Archives
6/16/2014 12
13. Cloudian Complete S3 API
• Core REST API – Get, Put, Post, Head, Delete
• Multi-part uploads: Allows uploading large objects
in multiple parts
• Versioning: Multiple versions of same object
• Bucket Lifecycle: Auto-expiration using rules
• Server side encryption: Managed by Cloudian
• Location Constraint: Assign data to specific region
(e.g. for HIPAA compliance)
• Bucket Website: Create buckets as websites to
host web content
• Access control lists (ACLs) define access rights to
bucket and object
• And more...
Cloudian Complete S3 API
Products S3 API
Cloudian
AmpliData
Basho
Caringo
Cleversafe
EMC Atmos
NetApp Bycast
Scality
OpenStack Swift
6/16/2014 13
14. Seamless tiering to Amazon S3, Glacier and
other S3 Service Providers
146/16/2014
• Cloudian deployed as On-Premises
S3 cloud behind the firewall
• Automatically migrates data to AWS
using Bucket Lifecycle Policies
– Optional migration to Glacier
– Metadata maintained for
search/list of objects
• Configurable to reduce
overhead
• Read/Writes to migrated objects
– restore by default, option to
redirect to AWS/S3 Service
Provider
On-Premises S3
S3
Client/Application
Content migrated
or restored via
Bucket Lifecycle
Policies
Option to redirect
migrated content
Amazon S3
Firewall
Amazon Glacier
15. Big Data Storage Platform
15
Event Processing Platform Big Data Storage Platform
Input I/F Recommend
CEP Engine
Filter Judge Aggregate
Real Time Analysis
Big Data Analysis
Analyze Recommend
Data Analysis and Storage Platform
Content Storage
Consumer Activity
(Events, GPS, WiFi)
Social mediaBusiness Tracking
(goods, inventory, campaign, sales)
Smarter
Business
6/16/2014
16. Future Work
• Delivery of Cloudian Hadoop-ready
object storage (2HCY14)
• Integration with key Hadoop
distributions
• Locality awareness
• Potentially use new drive technology for
processing (eg HGST Ethernet drive)
• Find out more – Booth 139
6/16/2014 16
17. Cloudian®
S3 Cloud Storage Platform
Thank You!
Questions?
www.cloudian.com
“The Leading Provider of Hybrid Cloud Storage”