SlideShare uma empresa Scribd logo
1 de 60
1©MapR Technologies - Confidential
Inside MapR’s M7
How to get a million ops per second on 10 nodes
2©MapR Technologies - Confidential
Me, Us
 Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG
 MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
 Tonight
Hash tag - #mapr #fast
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
3©MapR Technologies - Confidential
MapR does MapReduce (fast)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.5 TB in 59 seconds
2103 nodes
4©MapR Technologies - Confidential
MapR: Lights Out Data Center Ready
• Automated stateful failover
• Automated re-replication
• Self-healing from HW and SW
• Load balancing
• Rolling upgrades
• No lost jobs or data
• 99999’s of uptime
Reliable Compute Dependable Storage
• Business continuity with snapshots
and mirrors
• Recover to a point in time
• End-to-end check summing
• Strong consistency
• Built-in compression
• Mirror between two sites by RTO
5©MapR Technologies - Confidential
Part 1:
What’s past is prologue
6©MapR Technologies - Confidential
Part 1:
What’s past is prologue
HBase is really good
except when it isn’t
but it has a heart of gold
7©MapR Technologies - Confidential
Part 2:
An implementation tour
8©MapR Technologies - Confidential
Part 2:
An implementation tour
with many tricks
and clever ploys
9©MapR Technologies - Confidential
Part 3:
10©MapR Technologies - Confidential
11©MapR Technologies - Confidential
Part 1:
What’s past is prologue
12©MapR Technologies - Confidential
Dynamo DB
Vertex DB
13©MapR Technologies - Confidential
HBase Table Architecture
 Tables are divided into key ranges (regions)
 Regions are served by nodes (RegionServers)
 Columns are divided into access groups (columns families)
14©MapR Technologies - Confidential
HBase Architecture is Better
 Strong consistency model
– when a write returns, all readers will see same value
– "eventually consistent" is often "eventually inconsistent"
 Scan works
– does not broadcast
– ring-based NoSQL databases (eg, Cassandra, Riak) suffer on scans
 Scales automatically
– Splits when regions become too large
– Uses HDFS to spread data, manage space
 Integrated with Hadoop
– map-reduce on HBase is straightforward
15©MapR Technologies - Confidential
But ... how well do you know HBCK?
a.k.a. HBase Recovery
 HBase-5843: Improve HBase MTTR – Mean Time To Recover
 HBase-6401: HBase may lose edits after a crash with 1.0.3
– uses appends
 HBase-3809: .META. may not come back online if ….
 etc
 about 40-50 Jiras on this topic
 Very complex algorithm to assign a region
– and still does not get it right on reboot
16©MapR Technologies - Confidential
HBase Issues
•Compactions disrupt operations
•Very slow crash recovery
•Unreliable splitting
Business continuity
•Common hardware/software issues cause downtime
•Administration requires downtime
•No point-in-time recovery
•Complex backup process
•Many bottlenecks result in low throughput
•Limited data locality
•Limited # of tables
•Compactions, splits and merges must be done manually (in reality)
•Basic operations like backup or table rename are complex
17©MapR Technologies - Confidential
Examples: Performance Issues
 Limited support for multiple column families: HBase has
issues handling multiple column family due to compactions. The standard
HBase documentation recommends no more than 2-3 column families.
 Limited data locality: HBase does not take into account block locations
when assigning regions. After a reboot, RegionServers are often reading data
over the network rather than the local drives. (HBASE-4755, HBASE-4491)
 Cannot utilize disk space: HBase RegionServers struggle with more
than 50-150 regions per RegionServer so a commodity server can only handle
about 1TB of HBase data, wasting disk space.
 Limited # of tables: A single cluster can only handle several tens of
tables effectively.
18©MapR Technologies - Confidential
Examples: Manageability Issues
 Manual major compactions: HBase major compactions are disruptive
so production clusters keep them disabled and rely on the administrator to
manually trigger compactions.
 Manual splitting: HBase auto-splitting does not work properly in a busy
cluster so users must pre-split a table based on their estimate of data
size/growth. (
 Manual merging: HBase does not automatically merge regions that are
too small. The administrator must take down the cluster and trigger the
merges manually.
 Basic administration is complex: Renaming a table requires copying
all the data. Backing up a cluster is a complex process. (HBASE-643)
19©MapR Technologies - Confidential
Examples: Reliability Issues
 Compactions disrupt HBase operations: I/O bursts overwhelm
nodes (
 Very slow crash recovery: RegionServer crash can cause data to be
unavailable for up to 30 minutes while WALs are replayed for
impacted regions. (HBASE-1111)
 Unreliable splitting: Region splitting may cause data to be
inconsistent and unavailable.
 No client throttling: HBase client can easily overwhelm
RegionServers and cause downtime. (HBASE-5161, HBASE-5162)
20©MapR Technologies - Confidential
One Issue – Crash Recovery Too Slow
 HBASE-1111 superseded by HBASE-5843 which is blocked by
HDFS-3912 HBASE-6736 HBASE-6970 HBASE-7989 HBASE-6315
HBASE-7815 HBASE-6737 HBASE-6738 HBASE-7271 HBASE-7590
HBASE-7756 HBASE-8204 HBASE-5992 HBASE-6156 HBASE-6878
HBASE-6364 HBASE-6713 HBASE-5902 HBASE-4755 HBASE-7006
HDFS-2576 HBASE-6309 HBASE-6751 HBASE-6752 HBASE-6772
HBASE-6773 HBASE-6774 HBASE-7246 HBASE-7334 HBASE-5859
HBASE-6058 HBASE-6290 HBASE-7213 HBASE-5844 HBASE-5924
HBASE-6435 HBASE-6783 HBASE-7247 HBASE-7327 HDFS-4721
HBASE-5877 HBASE-5926 HBASE-5939 HBASE-5998 HBASE-6109
HBASE-6870 HBASE-5930 HDFS-4754 HDFS-3705
21©MapR Technologies - Confidential
What is the
source of these
22©MapR Technologies - Confidential
RegionServers are problematic
 Coordinating 3 separate distributed systems is very hard
– HBase, HDFS, ZK
– Each of these systems has multiple internal systems
– Too many races, too many undefined properties
 Distributed transaction framework not available
– Too many failures to deal with
 Java GC wipes out the RS from time to time
– Cannot use -Xmx20g for a RS
 Hence all the bugs
– HBCK is your "friend"
23©MapR Technologies - Confidential
Region Assignment in Apache HBase
24©MapR Technologies - Confidential
 Files are broken into blocks
 Distributed across data-nodes
 NameNode holds (in DRAM)
 Directories, Files
 Block replica locations
 Data Nodes
 Serve blocks
 No idea about files/dirs
 All ops go to NN
HDFS Architecture Review
DataNodes save Blocks
sharded into
25©MapR Technologies - Confidential
 NameNode holds in-memory
 Dir hierarchy ("names")
 File attrs ("inode")
 Composite file structure
 Array of block-ids
 1-byte file in HDFS
 1 HDFS "block" on 3 DN's
 3 entries in NN totaling 1K DRAM
A File at the NameNode
Composite File Structure
26©MapR Technologies - Confidential
DN reports blocks to NN
– 128M blocks
– 12T of disk => DN sends 100K blocks/report
– RPC on wire is 4M
– causes extreme load
• at both DN and NN
 With NN-HA, DN's do dual block-reports
– one to primary, one to secondary
– doubles the load on DN
NN scalability problems
27©MapR Technologies - Confidential
Scaling Parameters
 Unit of I/O
– 4K/8K (8K in MapR)
 Unit of Chunking (a map-reduce
– 10-100's of megabytes
 Unit of Resync (a replica)
– 10-100's of gigabytes
– container in MapR
HDFS 'block'
 Unit of Administration
(snap, repl, mirror, quota, backu
– 1 gigabyte - 1000's of terabytes
– volume in MapR
– what data is affected by my
missing blocks?
28©MapR Technologies - Confidential
MapR's No-NameNode Architecture
HDFS Federation MapR (distributed metadata)
• Multiple single points of failure
• Limited to 50-200 million files
• Performance bottleneck
• Commercial NAS required
• HA w/ automatic failover
• Instant cluster restart
• Up to 1 trillion files
• 20x higher performance
• 100% commodity hardware
DataNode DataNode DataNode
DataNode DataNode DataNode
29©MapR Technologies - Confidential
 Each container contains
 Directories & files
 Data blocks
 Replicated on servers
 Millions of containers in
a typical cluster
MapR's Distributed NameNode
Files/directories are sharded into blocks, which
are placed into mini NNs (containers ) on disks
Containers are 16-
32 GB segments of
disk, placed on
Patent Pending
30©MapR Technologies - Confidential
M7 Containers
 Container holds many files
– regular, dir, symlink, btree, chunk-map, region-map, …
– all random-write capable
– each can hold 100's of millions of files
 Container is replicated to servers
– unit of resynchronization
 Region lives entirely inside 1 container
– all files + WALs + btree's + bloom-filters + range-maps
31©MapR Technologies - Confidential
Read-write Replication
 Write are synchronous
– All copies have same data
 Data is replicated in a "chain"
– better bandwidth, utilizes full-
duplex network links well
 Meta-data is replicated in a "star"
– response time better, bandwidth not
of concern
– data can also be done this way
33©MapR Technologies - Confidential
HB loss + upstream entity
reports failure
=> server dead
Increment epoch at CLDB
Rearrange replication
Exact same code for files
and M7 tables
No ZK needed at this level
Failure Handling
Containers managed at CLDB (HB, container-reports).
Container Location DataBase
34©MapR Technologies - Confidential
Same 10 nodes, but with 3X repl
0 1000 2000 3000 4000 5000 6000
Files (M)
0 100 200 400 600 800 1000
MapR distribution
Other distribution
Benchmark: File creates (100B)
Hardware: 10 nodes, 2 x 4 cores, 24 GB
RAM, 12 x 1 TB 7200 RPM
0 0.5 1 1.5
Files (M)
Other distributionMapR Other Advantage
Rate (creates/s) 14-16K 335-360 40x
Scale (files) 6B 1.3M 4615x
35©MapR Technologies - Confidential
 HBase has a good basis
– But is handicapped by HDFS
– But can’t do without HDFS
– HBase can’t be fixed in isolation
 Separating key storage scaling parameters is key
– Allows additional layer of storage indirection
– Results in huge scaling and performance improvement
 Low-level transactions is hard
– Allows R/W file system, decentralized meta-data
– Also allows non-file implementations
36©MapR Technologies - Confidential
Part 2:
An implementation tour
37©MapR Technologies - Confidential
An Outline of Important Factors
 Start with MapR FS (mutability, transactions, real snapshots)
 C++ not Java (data never moves, better control)
 Lockless design, custom queue executive (3 ns switch)
 New RPC layer (> 1 M RPC / s)
 Cut out the middle man (single hop to data)
 Hybridize log-structured merge trees and B-trees
 Adjust sizes and fanouts
 Don’t be silly
38©MapR Technologies - Confidential
An Outline of Important Factors
 Start with MapR FS (mutability, transactions, real snapshots)
 C++ not Java (data never moves, better control)
 Lockless design, custom queue executive (3 ns switch)
 New RPC layer (> 1 M RPC / s)
 Cut out the middle man (single hop to data)
 Hybridize log-structured merge trees and B-trees
 Adjust sizes and fanouts
 Don’t be silly
We get these all for
free by putting
tables into MapR FS
39©MapR Technologies - Confidential
M7: Tables Integrated into Storage
No extra daemons to manage
One hop to data
Superior caching
No JVM problems
40©MapR Technologies - Confidential
Lesson 0:
tables in the
file system
41©MapR Technologies - Confidential
Why Not Java?
 Disclaimer: I am a pro-Java bigot
 But that only goes so far …
 Consider the memory size of
struct {x, y}[] a;
 Consider also interpreting data as it has arrived from the wire
 Consider the problem of writing a micro-stack queue executive
with hundreds of thousands of threads and 3 ns context switch
 Consider the problem of a core-locked processes running cache
aware, lock-free, zero copy queue of tasks
 Consider the GC-free life-style
42©MapR Technologies - Confidential
At What Cost
 But writing performant C++ is hard
 Managing low-level threads is hard
 Implementing very fast failure recovery is hard
 Doing manual memory allocation is hard (and dangerous)
 Benefits outweigh costs with the right dev team
 Benefits dwarfed by the costs with the wrong dev team
43©MapR Technologies - Confidential
Lesson 1: With
great speed
comes great
44©MapR Technologies - Confidential
M7 Table Architecture
tablet tablet
45©MapR Technologies - Confidential
M7 Table Architecture
tablet tablet
This structure is
internal and not
46©MapR Technologies - Confidential
Multi-level Design
 Fixed number of levels like HBase
 Specialized fanout to match sizes to device physics
 Mutable file system allows chimeric LSM-tree / B-tree
 Sized to match container structure
 Guaranteed locality
– If the data moves, the new node will handle it
– If the node fails, the new node will handle it
47©MapR Technologies - Confidential
Lesson 2:
Physics. Not
just a good
idea. It’s the
48©MapR Technologies - Confidential
RPC Reimplementation
 At very high data rates, protobuf is too slow
– Not good as an envelope, still a great schema definition language
– Most systems never hit this limit
 Alternative 1
– Lazy parsing allows deferral of content parsing
– Naïve implementation imposes (yet another) extra copy
 Alternative 2
– Bespoke parsing of envelope from the wire
– Content packages can land fully aligned and ready for battle directly from
the wire
 Let’s use BOTH ideas
49©MapR Technologies - Confidential
Lesson 3:
Hacking and
abstraction can
50©MapR Technologies - Confidential
Don’t Be Silly
 Detailed review of the code revealed an extra copy
– It was subtle. Really.
 Performance increased when this was stopped
 Not as easy to spot as it sounds
– But absolutely still worth finding and fixing
51©MapR Technologies - Confidential
Part 3:
52©MapR Technologies - Confidential
Server Reboot
 Full container-reports are tiny
– CLDB needs 2G dram for 1000-node cluster
 Volumes come online very fast
– each volume independent of others
– as soon as min-repl # of containers ready
– no need to wait for whole cluster
(eg, HDFS waits for 99.9% blocks reporting)
 1000-node cluster restart < 5 mins
53©MapR Technologies - Confidential
M7 provides Instant Recovery
 0-40 microWALs per region
– idle WALs go to zero quickly, so most are empty
– region is up before all microWALs are recovered
– recovers region in background in parallel
– when a key is accessed, that microWAL is recovered inline
– 1000-10000x faster recovery
 Why doesn't HBase do this?
– M7 leverages unique MapR-FS capabilities, not impacted by HDFS
– No limit to # of files on disk
– No limit to # open files
– I/O path translates random writes to sequential writes on disk
54©MapR Technologies - Confidential
Other M7 Features
 Smaller disk footprint
– M7 never repeats the key or column name
 Columnar layout
– M7 supports 64 column families
– in-memory column-families
 Online admin
– M7 schema changes on the fly
– delete/rename/redistribute tables
55©MapR Technologies - Confidential
Binary Compatible
 HBase applications work "as is" with M7
– No need to recompile (binary compatible)
 Can run M7 and HBase side-by-side on the same cluster
– eg, during a migration
– can access both M7 table and HBase table in same program
 Use standard Apache HBase CopyTable tool to copy a table
from HBase to M7 or vice-versa, viz.,
% hbase org.apache.hadoop.hbase.mapreduce.CopyTable oldtable
56©MapR Technologies - Confidential
M7 vs CDH - Mixed Load 50-50
57©MapR Technologies - Confidential
M7 vs CDH - Mixed Load 50-50
58©MapR Technologies - Confidential
M7 vs CDH - Mixed Load 50-50
59©MapR Technologies - Confidential
 HBase has some excellent core ideas
– But is burdened by years of technical debt
– Much of the debt was charged on the HDFS credit cards
 MapR FS provides ideal substrate for HBase-like service
– One hop from client to data
– Many problems never even exist in the first place
– Other problems have relatively simple solutions with better foundation
 Practical results bear out the theory
60©MapR Technologies - Confidential
Me, Us
 Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG
 MapR
Distributes more open source components for Hadoop
Adds major technology for performance and HA
Adds industry standard API’s
 Tonight
Hash tag - #nosqlnow #mapr #fast
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
61©MapR Technologies - Confidential

Mais conteúdo relacionado

Mais procurados

Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaDataWorks Summit
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David TuckerChallenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David TuckerMapR Technologies
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshotsenissoz
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz

Mais procurados (20)

Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David TuckerChallenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
Apache kudu
Apache kuduApache kudu
Apache kudu
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas


HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapRlohitvijayarenu
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesMapR Technologies
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleMapR Technologies
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012MapR Technologies
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation WorkshopMapR Technologies
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeMapR Technologies
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションApache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションMapR Technologies Japan
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQLHBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQLMapR Technologies
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl

Destaque (15)

HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL References
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at Scale
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションApache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQLHBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning

Semelhante a Inside MapR's M7

2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoBig Data Joe™ Rossi
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceMapR Technologies
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trencheswchevreuil
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton

Semelhante a Inside MapR's M7 (20)

HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR

Mais de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies

Mais de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL


The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10

Último (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite

Inside MapR's M7

  • 1. 1©MapR Technologies - Confidential Inside MapR’s M7 How to get a million ops per second on 10 nodes
  • 2. 2©MapR Technologies - Confidential Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Tonight Hash tag - #mapr #fast See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR
  • 3. 3©MapR Technologies - Confidential MapR does MapReduce (fast) TeraSort Record 1 TB in 54 seconds 1003 nodes MinuteSort Record 1.5 TB in 59 seconds 2103 nodes
  • 4. 4©MapR Technologies - Confidential MapR: Lights Out Data Center Ready • Automated stateful failover • Automated re-replication • Self-healing from HW and SW failures • Load balancing • Rolling upgrades • No lost jobs or data • 99999’s of uptime Reliable Compute Dependable Storage • Business continuity with snapshots and mirrors • Recover to a point in time • End-to-end check summing • Strong consistency • Built-in compression • Mirror between two sites by RTO policy
  • 5. 5©MapR Technologies - Confidential Part 1: What’s past is prologue
  • 6. 6©MapR Technologies - Confidential Part 1: What’s past is prologue HBase is really good except when it isn’t but it has a heart of gold
  • 7. 7©MapR Technologies - Confidential Part 2: An implementation tour
  • 8. 8©MapR Technologies - Confidential Part 2: An implementation tour with many tricks and clever ploys
  • 9. 9©MapR Technologies - Confidential Part 3: Results
  • 10. 10©MapR Technologies - Confidential
  • 11. 11©MapR Technologies - Confidential Part 1: What’s past is prologue
  • 12. 12©MapR Technologies - Confidential Dynamo DB ZopeDB Shoal CloudKit Vertex DB FlockD B NoSQL
  • 13. 13©MapR Technologies - Confidential HBase Table Architecture  Tables are divided into key ranges (regions)  Regions are served by nodes (RegionServers)  Columns are divided into access groups (columns families) CF1 CF2 CF3 CF4 CF5 R1 R2 R3 R4
  • 14. 14©MapR Technologies - Confidential HBase Architecture is Better  Strong consistency model – when a write returns, all readers will see same value – "eventually consistent" is often "eventually inconsistent"  Scan works – does not broadcast – ring-based NoSQL databases (eg, Cassandra, Riak) suffer on scans  Scales automatically – Splits when regions become too large – Uses HDFS to spread data, manage space  Integrated with Hadoop – map-reduce on HBase is straightforward
  • 15. 15©MapR Technologies - Confidential But ... how well do you know HBCK? a.k.a. HBase Recovery  HBase-5843: Improve HBase MTTR – Mean Time To Recover  HBase-6401: HBase may lose edits after a crash with 1.0.3 – uses appends  HBase-3809: .META. may not come back online if ….  etc  about 40-50 Jiras on this topic  Very complex algorithm to assign a region – and still does not get it right on reboot
  • 16. 16©MapR Technologies - Confidential HBase Issues Reliability •Compactions disrupt operations •Very slow crash recovery •Unreliable splitting Business continuity •Common hardware/software issues cause downtime •Administration requires downtime •No point-in-time recovery •Complex backup process Performance •Many bottlenecks result in low throughput •Limited data locality •Limited # of tables Manageability •Compactions, splits and merges must be done manually (in reality) •Basic operations like backup or table rename are complex
  • 17. 17©MapR Technologies - Confidential Examples: Performance Issues  Limited support for multiple column families: HBase has issues handling multiple column family due to compactions. The standard HBase documentation recommends no more than 2-3 column families. (HBASE-3149)  Limited data locality: HBase does not take into account block locations when assigning regions. After a reboot, RegionServers are often reading data over the network rather than the local drives. (HBASE-4755, HBASE-4491)  Cannot utilize disk space: HBase RegionServers struggle with more than 50-150 regions per RegionServer so a commodity server can only handle about 1TB of HBase data, wasting disk space. (,  Limited # of tables: A single cluster can only handle several tens of tables effectively. (
  • 18. 18©MapR Technologies - Confidential Examples: Manageability Issues  Manual major compactions: HBase major compactions are disruptive so production clusters keep them disabled and rely on the administrator to manually trigger compactions. (  Manual splitting: HBase auto-splitting does not work properly in a busy cluster so users must pre-split a table based on their estimate of data size/growth. ( with-hbase-dynamic.html)  Manual merging: HBase does not automatically merge regions that are too small. The administrator must take down the cluster and trigger the merges manually.  Basic administration is complex: Renaming a table requires copying all the data. Backing up a cluster is a complex process. (HBASE-643)
  • 19. 19©MapR Technologies - Confidential Examples: Reliability Issues  Compactions disrupt HBase operations: I/O bursts overwhelm nodes (  Very slow crash recovery: RegionServer crash can cause data to be unavailable for up to 30 minutes while WALs are replayed for impacted regions. (HBASE-1111)  Unreliable splitting: Region splitting may cause data to be inconsistent and unavailable. ( hbase-dynamic.html)  No client throttling: HBase client can easily overwhelm RegionServers and cause downtime. (HBASE-5161, HBASE-5162)
  • 20. 20©MapR Technologies - Confidential One Issue – Crash Recovery Too Slow  HBASE-1111 superseded by HBASE-5843 which is blocked by HDFS-3912 HBASE-6736 HBASE-6970 HBASE-7989 HBASE-6315 HBASE-7815 HBASE-6737 HBASE-6738 HBASE-7271 HBASE-7590 HBASE-7756 HBASE-8204 HBASE-5992 HBASE-6156 HBASE-6878 HBASE-6364 HBASE-6713 HBASE-5902 HBASE-4755 HBASE-7006 HDFS-2576 HBASE-6309 HBASE-6751 HBASE-6752 HBASE-6772 HBASE-6773 HBASE-6774 HBASE-7246 HBASE-7334 HBASE-5859 HBASE-6058 HBASE-6290 HBASE-7213 HBASE-5844 HBASE-5924 HBASE-6435 HBASE-6783 HBASE-7247 HBASE-7327 HDFS-4721 HBASE-5877 HBASE-5926 HBASE-5939 HBASE-5998 HBASE-6109 HBASE-6870 HBASE-5930 HDFS-4754 HDFS-3705
  • 21. 21©MapR Technologies - Confidential What is the source of these problems?
  • 22. 22©MapR Technologies - Confidential RegionServers are problematic  Coordinating 3 separate distributed systems is very hard – HBase, HDFS, ZK – Each of these systems has multiple internal systems – Too many races, too many undefined properties  Distributed transaction framework not available – Too many failures to deal with  Java GC wipes out the RS from time to time – Cannot use -Xmx20g for a RS  Hence all the bugs – HBCK is your "friend"
  • 23. 23©MapR Technologies - Confidential Region Assignment in Apache HBase
  • 24. 24©MapR Technologies - Confidential  Files are broken into blocks  Distributed across data-nodes  NameNode holds (in DRAM)  Directories, Files  Block replica locations  Data Nodes  Serve blocks  No idea about files/dirs  All ops go to NN HDFS Architecture Review DataNodes save Blocks Files sharded into blocks
  • 25. 25©MapR Technologies - Confidential  NameNode holds in-memory  Dir hierarchy ("names")  File attrs ("inode")  Composite file structure  Array of block-ids  1-byte file in HDFS  1 HDFS "block" on 3 DN's  3 entries in NN totaling 1K DRAM A File at the NameNode Composite File Structure
  • 26. 26©MapR Technologies - Confidential DN reports blocks to NN – 128M blocks – 12T of disk => DN sends 100K blocks/report – RPC on wire is 4M – causes extreme load • at both DN and NN  With NN-HA, DN's do dual block-reports – one to primary, one to secondary – doubles the load on DN NN scalability problems
  • 27. 27©MapR Technologies - Confidential Scaling Parameters  Unit of I/O – 4K/8K (8K in MapR)  Unit of Chunking (a map-reduce split) – 10-100's of megabytes  Unit of Resync (a replica) – 10-100's of gigabytes – container in MapR i/o 10^3 map-red 10^6 resync 10^9 admin HDFS 'block'  Unit of Administration (snap, repl, mirror, quota, backu p) – 1 gigabyte - 1000's of terabytes – volume in MapR – what data is affected by my missing blocks?
  • 28. 28©MapR Technologies - Confidential NameNode E F NameNode E F NameNode E F MapR's No-NameNode Architecture HDFS Federation MapR (distributed metadata) • Multiple single points of failure • Limited to 50-200 million files • Performance bottleneck • Commercial NAS required • HA w/ automatic failover • Instant cluster restart • Up to 1 trillion files • 20x higher performance • 100% commodity hardware NAS appliance NameNode A B NameNode C D NameNode E F DataNode DataNode DataNode DataNode DataNode DataNode A F C D E D B C E B C F B F A B A D E
  • 29. 29©MapR Technologies - Confidential  Each container contains  Directories & files  Data blocks  Replicated on servers  Millions of containers in a typical cluster MapR's Distributed NameNode Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks Containers are 16- 32 GB segments of disk, placed on nodes Patent Pending
  • 30. 30©MapR Technologies - Confidential M7 Containers  Container holds many files – regular, dir, symlink, btree, chunk-map, region-map, … – all random-write capable – each can hold 100's of millions of files  Container is replicated to servers – unit of resynchronization  Region lives entirely inside 1 container – all files + WALs + btree's + bloom-filters + range-maps
  • 31. 31©MapR Technologies - Confidential Read-write Replication  Write are synchronous – All copies have same data  Data is replicated in a "chain" fashion – better bandwidth, utilizes full- duplex network links well  Meta-data is replicated in a "star" manner – response time better, bandwidth not of concern – data can also be done this way 31 client1 client2 clientN
  • 32. 33©MapR Technologies - Confidential HB loss + upstream entity reports failure => server dead Increment epoch at CLDB Rearrange replication Exact same code for files and M7 tables No ZK needed at this level Failure Handling Containers managed at CLDB (HB, container-reports). Container Location DataBase (CLDB)
  • 33. 34©MapR Technologies - Confidential Same 10 nodes, but with 3X repl 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 1000 2000 3000 4000 5000 6000 Filecreates/s Files (M) 0 100 200 400 600 800 1000 MapR distribution Other distribution Benchmark: File creates (100B) Hardware: 10 nodes, 2 x 4 cores, 24 GB RAM, 12 x 1 TB 7200 RPM 0 50 100 150 200 250 300 350 400 0 0.5 1 1.5 Filecreates/s Files (M) Other distributionMapR Other Advantage Rate (creates/s) 14-16K 335-360 40x Scale (files) 6B 1.3M 4615x
  • 34. 35©MapR Technologies - Confidential Recap  HBase has a good basis – But is handicapped by HDFS – But can’t do without HDFS – HBase can’t be fixed in isolation  Separating key storage scaling parameters is key – Allows additional layer of storage indirection – Results in huge scaling and performance improvement  Low-level transactions is hard – Allows R/W file system, decentralized meta-data – Also allows non-file implementations
  • 35. 36©MapR Technologies - Confidential Part 2: An implementation tour
  • 36. 37©MapR Technologies - Confidential An Outline of Important Factors  Start with MapR FS (mutability, transactions, real snapshots)  C++ not Java (data never moves, better control)  Lockless design, custom queue executive (3 ns switch)  New RPC layer (> 1 M RPC / s)  Cut out the middle man (single hop to data)  Hybridize log-structured merge trees and B-trees  Adjust sizes and fanouts  Don’t be silly
  • 37. 38©MapR Technologies - Confidential An Outline of Important Factors  Start with MapR FS (mutability, transactions, real snapshots)  C++ not Java (data never moves, better control)  Lockless design, custom queue executive (3 ns switch)  New RPC layer (> 1 M RPC / s)  Cut out the middle man (single hop to data)  Hybridize log-structured merge trees and B-trees  Adjust sizes and fanouts  Don’t be silly We get these all for free by putting tables into MapR FS
  • 38. 39©MapR Technologies - Confidential M7: Tables Integrated into Storage No extra daemons to manage One hop to data Superior caching policies No JVM problems
  • 39. 40©MapR Technologies - Confidential Lesson 0: Implement tables in the file system
  • 40. 41©MapR Technologies - Confidential Why Not Java?  Disclaimer: I am a pro-Java bigot  But that only goes so far …  Consider the memory size of struct {x, y}[] a;  Consider also interpreting data as it has arrived from the wire  Consider the problem of writing a micro-stack queue executive with hundreds of thousands of threads and 3 ns context switch  Consider the problem of a core-locked processes running cache aware, lock-free, zero copy queue of tasks  Consider the GC-free life-style
  • 41. 42©MapR Technologies - Confidential At What Cost  But writing performant C++ is hard  Managing low-level threads is hard  Implementing very fast failure recovery is hard  Doing manual memory allocation is hard (and dangerous)  Benefits outweigh costs with the right dev team  Benefits dwarfed by the costs with the wrong dev team
  • 42. 43©MapR Technologies - Confidential Lesson 1: With great speed comes great responsibility
  • 43. 44©MapR Technologies - Confidential M7 Table Architecture table tablet tablet partition segmentsegment parition tablet tablet
  • 44. 45©MapR Technologies - Confidential M7 Table Architecture table tablet tablet partition segmentsegment parition tablet tablet This structure is internal and not user-visible
  • 45. 46©MapR Technologies - Confidential Multi-level Design  Fixed number of levels like HBase  Specialized fanout to match sizes to device physics  Mutable file system allows chimeric LSM-tree / B-tree  Sized to match container structure  Guaranteed locality – If the data moves, the new node will handle it – If the node fails, the new node will handle it
  • 46. 47©MapR Technologies - Confidential Lesson 2: Physics. Not just a good idea. It’s the law.
  • 47. 48©MapR Technologies - Confidential RPC Reimplementation  At very high data rates, protobuf is too slow – Not good as an envelope, still a great schema definition language – Most systems never hit this limit  Alternative 1 – Lazy parsing allows deferral of content parsing – Naïve implementation imposes (yet another) extra copy  Alternative 2 – Bespoke parsing of envelope from the wire – Content packages can land fully aligned and ready for battle directly from the wire  Let’s use BOTH ideas
  • 48. 49©MapR Technologies - Confidential Lesson 3: Hacking and abstraction can co-exist
  • 49. 50©MapR Technologies - Confidential Don’t Be Silly  Detailed review of the code revealed an extra copy – It was subtle. Really.  Performance increased when this was stopped  Not as easy to spot as it sounds – But absolutely still worth finding and fixing
  • 50. 51©MapR Technologies - Confidential Part 3: Results
  • 51. 52©MapR Technologies - Confidential Server Reboot  Full container-reports are tiny – CLDB needs 2G dram for 1000-node cluster  Volumes come online very fast – each volume independent of others – as soon as min-repl # of containers ready – no need to wait for whole cluster (eg, HDFS waits for 99.9% blocks reporting)  1000-node cluster restart < 5 mins
  • 52. 53©MapR Technologies - Confidential M7 provides Instant Recovery  0-40 microWALs per region – idle WALs go to zero quickly, so most are empty – region is up before all microWALs are recovered – recovers region in background in parallel – when a key is accessed, that microWAL is recovered inline – 1000-10000x faster recovery  Why doesn't HBase do this? – M7 leverages unique MapR-FS capabilities, not impacted by HDFS limitations – No limit to # of files on disk – No limit to # open files – I/O path translates random writes to sequential writes on disk
  • 53. 54©MapR Technologies - Confidential Other M7 Features  Smaller disk footprint – M7 never repeats the key or column name  Columnar layout – M7 supports 64 column families – in-memory column-families  Online admin – M7 schema changes on the fly – delete/rename/redistribute tables
  • 54. 55©MapR Technologies - Confidential Binary Compatible  HBase applications work "as is" with M7 – No need to recompile (binary compatible)  Can run M7 and HBase side-by-side on the same cluster – eg, during a migration – can access both M7 table and HBase table in same program  Use standard Apache HBase CopyTable tool to copy a table from HBase to M7 or vice-versa, viz., % hbase org.apache.hadoop.hbase.mapreduce.CopyTable oldtable
  • 55. 56©MapR Technologies - Confidential M7 vs CDH - Mixed Load 50-50
  • 56. 57©MapR Technologies - Confidential M7 vs CDH - Mixed Load 50-50
  • 57. 58©MapR Technologies - Confidential M7 vs CDH - Mixed Load 50-50
  • 58. 59©MapR Technologies - Confidential Recap  HBase has some excellent core ideas – But is burdened by years of technical debt – Much of the debt was charged on the HDFS credit cards  MapR FS provides ideal substrate for HBase-like service – One hop from client to data – Many problems never even exist in the first place – Other problems have relatively simple solutions with better foundation  Practical results bear out the theory
  • 59. 60©MapR Technologies - Confidential Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance and HA Adds industry standard API’s  Tonight Hash tag - #nosqlnow #mapr #fast See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR
  • 60. 61©MapR Technologies - Confidential

Notas do Editor

  1. The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
  2. Another major advantage with MapR is the distributed Namenode. The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is between 70-100M. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
  3. This slide needs a lot of work. Can you look at layout changes?
  4. The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.