NoSQL is often used to deal with large scale data needs, but most NoSQL solutions lack strong data integrity features. I will describe how one company is solving their big data problems using MapR M7. MapR's Data Platform (MDP) provides tables as a built in feature of the file system and exposes an HBase compatible API for accessing these tables. The architecture of the system is designed to scale to very large sizes, to avoid long pauses due to compaction and allow complete integration with MapR's snapshots and mirrors.
We will talk about a specific customer use-case where the customer's priorities for Reliability, Ease of Use and Business Continuity made M7 the best choice.
MapR’s innovations have also expanded the use cases that are possible with Hadoop. Not only do we support the full Hadoop API set. MapR provides support for NFS so any file-based application can access the cluster with no changes or rewrites required. MapR provides ODBC support, so any database application or SQL-based tool can access and manipulate data in a MapR cluster. MapR supports real-time streaming access. This greatly expands the applications that are possible with Hadoop moving beyond a batch limitation. Finally, the full HA, DR and data protection capabilities of MapR allow mission critical apps to be deployed safely and allows administrators to meet stringent SLA targets.
The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
(ed. Note: this slide is a great white board slide to summarize M7)The stack on the left is a representation of the HBase architecture found in all other distributions. HBase is deployed on a VM that stores its data in the HDFS layer running on a JVM that in turn stores its data in the Linux file system (ext3) which writes the data to disk. This stack results in a lot of administrative tasks, performance issues, and reliability issues. A lot of the infrastructure within HBase is an attempt to make up for the deficiencies in HDFS. You basically have a database solution that needs to deal with random IO that runs on top of a write-once file system. The middle stack shows how MapR simplified the lower part of the stack with our M5 edition that replaced HDFS and the dependency on the Linux file system with a random read/write storage layer. However, HBase is still a separate infrastructure running on top the storage layer within M5. The region servers are separate and users still experience downtime and delays when recovering from node failures and snapshots.With M7 on the far right, MapR has now unified tables and files into a unified data platform. We’ve eliminated the separate HBase infrastructure. The environment is much simpler to manage by eliminating the various redundant components. We’ve provided a uniform data management layer across files and tables, we’ve provided a consistent data protection layer. Recovery from node failures is in seconds, there is 100% data locality, HBase can read directly from snapshots. Files and tables are in the same namespace, volumes, and directories.