2. Who am I?
http://www.mapr.com/company/events/
speaking/devfest-dc-9-28-12
• Keys Botzum
• kbotzum@maprtech.com
• Senior Principal Technologist, MapR Technologies
• MapR Federal and Eastern Region
3. MapR’s Experience with Google Compute Engine
• Fast
– Virtualized public cloud
– Rivals on-premise physical
• Easy
– Provision 1,000s of servers in
minutes
• Cost effective
– Pay only for what you use
4. gcutil is your friend
• Command line tool that runs on your client machines to manage your
instances in your cloud
• Remarkably easy to use
– New server/instance: gcutil addinstance
– Connect to a server/instance: gcutil ssh
• Can create your own custom images using Google’s tools
– Using custom images is as easy as addinstance –image <image name>
– MapR is creating custom images for MapR clusters
5. MapReduce: A Paradigm Shift
• Distributed computing platform
• Large clusters
• Commodity hardware
• Pioneered at Google
• BigTable, MapReduce and Google File System
• Commercially available as Hadoop
6. MapR Technologies
• Open, enterprise-grade distribution for
Hadoop
– Easy, dependable and fast
– Open source with standards-based extensions
• Hadoop
– Big data analytics
– Inspired by MapReduce paper published by Google
scientists Jeffrey Dean and Sanjay Ghemawat in
2004
• MapR is recognized as a technology
leader
• MapR Hadoop Cloud Service now
available on Google Compute Engine
8. MapR’s Complete Distribution
for Apache Hadoop
MapR Control System
• Integrated, tested,
hardened and supported MapR
Heatmap™
LDAP, NIS
Integration
Quotas, CLI,
REST APT
Alerts, Alarms
• Integrated with
Accumulo
Hive Pig Oozle Sqoop HBase Whirr
• Runs on commodity
hardware
• Open source with Accumulo Mahout Cascading Naglos Ganglia Flume Zoo-
Integration Integration keeper
standards-based
extensions for:
• Security
• File-based access
Direct Snap-
• Most SQL-based Access
Real- Volumes Mirrors Data
Time shots Placemen
access NFS Streamin t
• Easiest integration g
No NameNode High Performance Stateful Failover
• High availability Architecture Direct Shuffle and Self Healing
• Best performance
MapR’s Storage Services™
2.7
9. Overview of Starting a Cluster
• Google’s gcutil is your friend
• Very easy tool for spinning up instances
• MapR is creating a tool and infrastructure to spin up a fully functional MapR
cluster composed of many nodes
• ./mapr-start-cluster.sh –machine-type <…> -masters <#> -slaves <#>
• …wait a few minutes
• gcutil ssh <node running admin server> and set admin’s password
• gcutil listinstances (to find your cluster’s IP addresses)
• … use the cluster, it’s fully functional
• ./mapr-stop-cluster.sh
• …billing for cluster stops
* Note that this is not the final interface, but rather is representative of what will be released. Some details omitted for
clarity.
10. Demo
Let’s run a large sort
Run TeraSort on a 1250-node MapR Hadoop cluster on
Google Compute Engine
(10 billion records, 1TB of data)
11. How does this Compare to Terasort
Records?
MapR on Record on physical
Google Compute hardware
Engine
Hardware Virtual/Cloud Physical
Cores 5024 11680
Disks 1256 5840
Servers 1256 1460
Time 1:20 min 1:02 min
12. Deployment Comparison
Current Record
1460 physical servers 1256 instances
Prepare datacenter Invoke gcutil command
Rack and stack servers
Maintain hardware
Months Minutes
13. Cost Comparison
Current Record
1460 1U servers x 1256 n1-standard-4-d x
$4K/server = $.58/instance hour x
80 seconds =
$5,840,000 $16
($728/hour)
14. Easy Management at Scale
• Health
Monitoring
• Cluster
Administration
• Application
Resource
Provisioning
15. Direct Access NFS™
File
Browsers
Standard
Linux
Commands
&
Tools
grep!
Access
Directly
sed!
“Drag
&
Drop”
sort!
tar!
Random
Read
Random
Write
Log
directly
Applica=ons
16. Multi-tenancy
§ Consider a large cluster with lots of storage and
numerous jobs supporting multiple
organizations
§ Volumes
§ Control storage usage
§ quotas on volumes
§ quotas on cluster storage by user or
group
§ Control data placement
§ ensure that data is stored in the
locations you want
§ Control mirroring and snapshotting
§ Job management
§ Control where jobs run
§ ensure that jobs run where you want
§ Historical view of metrics collected from
jobs
§ ease troubleshooting of job issues
§ Security/Protection
§ Fine grained permissions on volume and
cluster management, including delegation
17. MapR: Lights Out Data Center Ready
Dependable
Reliable Compute
Storage
• Automated
stateful
failover
§ Business
con=nuity
with
snapshots
and
mirrors
• Automated
re-‐replica=on
§ Recover
to
a
point
in
=me
• Self-‐healing
from
HW
§ End-‐to-‐end
check
and
SW
failures
summing
• Load
balancing
§ Strong
consistency
§ Built
in
compression
• Rolling
upgrades
§ Mirror
across
sites
to
• No
lost
jobs
or
data
meet
• 99999’s
of
up=me
Recovery
Time
Objec=ves
18. MapR Mirroring/COOP Requirements
Business
Con=nuity
Production Research and
Efficiency
Efficient
design
WAN § Differen=al
deltas
are
updated
Datacenter
1
Datacenter
1
§ Compressed
and
check-‐summed
Easy
to
manage
Production
WAN
Cloud § Scheduled
or
on-‐demand
§ WAN,
Remote
Seeding
§ Consistent
point-‐in-‐=me
Compute Engine
19. MapR Drives Hardware Performance
Typical Hadoop
% Performance vs. Apache/CDH
450%
400%
Commodity Hardware
350%
300%
250% % Perf vs.
Apache/CDH
200%
150%
100%
50%
0%
400MBPS 1200MBPS 1800MBPS SSD
<6 Drives 12*5400RPM Drives 12*7200RPM Drives 2*10GbE
1NIC >1NIC or 10GbE >1NIC or 10GbE 12+ Cores
6 Cores 8 Cores 12 Cores 64G DRAM
24G DRAM 32G DRAM 48G DRAM
Why is MapR faster and more efficient?
§ No
redundant
layers
(not
a
file
system
§ Na=ve
compression
over
a
file
system)
§ Op=mized
shuffle
§ C/C++
vs.
Java
(higher
performance
and
§ Advanced
cache
manager
no
garbage
collec=on
freezes)
§ Port
scaling
(mul=-‐NIC
support)
and
§ Distributed
metadata
high-‐speed
RPC
20. Designed for Performance and Scale
MapR Apache/CDH
Terasort w/ 1x replication (no compression)
Total (minutes) 24 min 34 sec 49 min 33 sec
Map 9 min 54 sec 28 min 12 sec
Shuffle 9 min 8 sec 27 min 0 sec
Terasort w/ 3x replication (no compression)
Total 47 min 4 sec 73 min 42 sec
Map 11 min 2 sec 30 min 8 sec
Shuffle 9 min 17 sec 28 min 40 sec
DFSIO/local write
Throughput/node 870 MB/s 240 MB/s
YCSB (HBase benchmark, 50% read, 50% update)
Throughput 33102 ops/sec 7904 ops/sec
Latency (r/u) 2.9-4 ms/0.4 ms 7-30 ms/0-5 ms
YCSB (HBase benchmark, 95% read, 5% update)
Throughput 18K ops/sec 8500 ops/sec
Latency (r/u) 5.5-5.7 ms/0.6 ms 12-30 ms/1 ms
HW: 10 servers, 2 x 4 cores (2.4 GHz), 11 x 2TB, 32 GB
21. Customer Support
• 24x7x365 “Follow-The-Sun” coverage
• Critical customer issues are worked on
around the clock
• Dedicated team of Hadoop engineering
experts
• Contacting MapR support
• Email: support@mapr.com
(automatically opens a case)
• Phone: 1.855.669.6277
• Self Service options:
§ http://answers.mapr.com/
§ Web Portal: http://mapr.com/
support
22. Two MapR Editions – M3 and M5
§ Control
System
§ Control
System
§ NFS
Access
§ NFS
Access
§ Performance
§ Performance
§ Unlimited
Nodes
§ High
Availability
§ Free
§ Snapshots
&
Mirroring
§ 24
X
7
Support
Also Available through:
§ Annual
Subscrip=on
Compute Engine
23. Try MapR on Google
Compute Engine
www.mapr.com/google
25. Latency Matters
• Ad-hoc analysis with interactive tools
• Real-time dashboards
• Event/trend detection and analysis
• Network intrusion analysis on the fly
• Fraud
• Failure detection and analysis
26. Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model
Users Developers Analysts and Developers
developers
Google project MapReduce Dremel
Open source Hadoop Storm and S4
project MapReduce
Introducing Apache Drill…
27. Innovations
• MapReduce
• Scalable IO and compute trumps efficiency with today's commodity hardware
• With large datasets, schemas and indexes are too limiting
• Flexibility is more important than efficiency
• An easy to use scalable, fault tolerant execution framework is key for large
clusters
• Dremel
• Columnar storage provides significant performance benefits at scale
• Columnar storage with nesting preserves structure and can be very efficient
• Avoiding final record assembly as long as possible improves efficiency
• Optimizing for the query use case can avoid the full generality of MR and thus
significantly reduce latency. No need to start JVMs, just push compact queries to
running agents.
• Apache Drill
• Open source project based upon Dremel’s ideas
• More flexibility and openness