SlideShare uma empresa Scribd logo
1 de 34
Design, Scale & Performance of the
        MapR Distribution

                       M.C. Srivas
             CTO, MapR Technologies, Inc.




 6/29/2011        © MapR Technologies, Inc.   1
Outline of Talk
• What does MapR do?
• Motivation: why build this?
• Distributed NameNode Architecture
  • Scalability factors
  • Programming model
  • Distributed transactions in MapR
• Performance across a variety of loads


     6/29/2011      © MapR Technologies, Inc.   2
Complete Distribution
   Integrated, tested,
    hardened
   Super simple
   Unique advanced
    features
   100% compatible
    with MapReduce,
    HBase, HDFS APIs
   No recompile
    required, drop in
    and use now



          6/29/2011       © MapR Technologies, Inc.   3
MapR Areas of Development

                        HBase                       Map
                                                   Reduce
   Ecosystem


            Storage
                                               Management
            Services




6/29/2011              © MapR Technologies, Inc.            4
JIRAs Open For Year(s)
• HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal
• HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize
  block-reports
• HDFS-222, 950 – Concatenate files into larger files
   • Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10
     million files, each using a block, would use about 3 gigabytes of memory.
     Scaling up much beyond this level is a problem with current hardware.
     Certainly a billion files is not feasible."
• HDFS Append – no 'blessed' Apache Hadoop distro has fix
• HDFS-233 – 25/Jun/08 – Snapshot support
   • Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly
     only if there is complete separation between namespace management and
     block management."



         6/29/2011            © MapR Technologies, Inc.                           5
Observations on Apache Hadoop
   Inefficient HDFS-347                            1200

                                                                      MB/sec
   Scaling problems HDFS-273                       1000


       NameNode bottleneck HDFS-395                     800

       Limited number of files HDFS-222                 600                   READ
                                                                               WRITE
   Admin overhead significant                           400


   NameNode failure loses data                          200

       Not trusted as permanent store                     0
                                                               HARDWARE HDFS
   Write-once
       Data lost unless file closed
          hflush/hsync – unrealistic to expect folks will re-write apps


          6/29/2011          © MapR Technologies, Inc.                          6
MapR Approach
• Some are architectural issues
• Change at that level is a big deal
  – Will not be accepted unless proven
  – Hard to prove without building it first


• Build it and prove it
  – Improve reliability significantly
  – Make it tremendously faster at the same time
  – Enable new class of apps (eg, real-time analytics)
       6/29/2011      © MapR Technologies, Inc.          7
HDFS Architecture Review
   Files are broken into blocks
     Distributed across data-nodes
   NameNode holds (in memory)
       Directories, Files
                                                                     Files
       Block replica locations                                      sharded into
                                                                     blocks
   Data Nodes
       Serve blocks
       No idea about files/dirs
       All ops go to NN
                                                  DataNodes save Blocks
          6/29/2011          © MapR Technologies, Inc.                        8
HDFS Architecture Review
DataNode (DN) reports blocks to
                                                                NameNode
NameNode (NN)
      Large DN does 60K blocks/report
          256M x 60K = 15T = 5 disks @ 3T per
                                                         DataNode     DataNode
      >100K causes extreme load
      40GB NN restart takes 1-2 hours

Addressing Unit is an individual block
      Flat block-address forces DN's to send giant block-reports
      NN can hold about ~300M blocks max
         Limits cluster size to 10's of Petabytes
         Increasing block size negatively impacts map/reduce
           6/29/2011         © MapR Technologies, Inc.                       9
How to Scale
• Central meta server does not scale
  – Make every server a meta-data server too
  – But need memory for map/reduce
    •    Must page meta-data to disk
• Reduce size of block-reports
  – while increasing number of blocks per DN
• Reduce memory footprint of location service
  – cannot add memory indefinitely
• Need fast-restart (HA)
        6/29/2011      © MapR Technologies, Inc.   10
MapR Goal: Scale to 1000X
                       HDFS                         MapR
     # files           150 million                  1 trillion
     # data            10-50 PB                     1-10 Exabytes
     # nodes           2000                         10,000+

Full random read/write semantics
      export via NFS and other protocols
      with enterprise-class reliability: instant-restart, snapshots,
       mirrors, no-single-point-of-failure, …
Run close to hardware speeds
      On extreme scale, efficiency matters extremely
      exploit emerging technology like SSD, 10GE

       6/29/2011        © MapR Technologies, Inc.                   11
MapR's Distributed NameNode
                    Files/directories are sharded into blocks, which
                    are placed into mini NNs (containers ) on disks
                                                        Each container contains
                                                            Directories & files
                                                            Data blocks
                                                        Replicated on servers
Containers are 16-                                      No need to manage
32 GB segments of                                        directly
disk, placed on                                         Use MapR Volumes
nodes

                                                                Patent Pending

        6/29/2011            © MapR Technologies, Inc.                           12
MapR Volumes
                          Significant advantages over “Cluster-
/projects                 wide” or “File-level” approaches

        /tahoe              Volumes allow management attributes
        /yosemite           to be applied in a scalable way at a
                            very granular level and with flexibility
/user
        /msmith             •   Replication factor
                            •   Scheduled mirroring
        /bjohnson           •   Scheduled snapshots
                            •   Data placement control
100K volumes are OK,        •   Usage tracking
  create as many as         •   Administrative permissions
       desired!


          6/29/2011      © MapR Technologies, Inc.                13
MapR Distributed NameNode
Containers are tracked globally
•   Clients cache containers & server info for extended periods

NameNode Map

     S1, S2, S4
                                      Client                        S1
                      Fetches                          Contacts
     S1, S3
                      container                        server to
     S1, S4, S5       locations                        read data
     S2, S3, S5                                        from the
                                                                        S3
                                                       container
     S2, S4, S5
     S3

                                                              S4   S5
                                         S2


          6/29/2011        © MapR Technologies, Inc.                14
MapR's Distr NameNode Scaling
Containers represent 16 - 32GB of data
      Each can hold up to 1 Billion files and directories
      100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
      25GB to cache all containers for 2EB cluster
         But not necessary, can page to disk
      Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
      Serve 100x more data-nodes
      Increase container size to 64G to serve 4EB cluster
            Map/reduce not affected


            6/29/2011        © MapR Technologies, Inc.         15
MapR Distr NameNode HA
MapR                           Apache Hadoop*
1. apt-get install mapr-cldb   1.    Stop cluster very carefully
while cluster is online        2.    Move fs.checkpoint.dir onto NAS (eg. NetApp)
                               3.    Install, configure DRBD + Heartbeat packages
                                       i. yum -y install drbd82 kmod-drbd82 heartbeat
                                      ii. chkconfig -add heartbeat (both machines)
                                      iii. edit /etc/drbd.conf on 2 machines
                                      iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero
                                     it if drbd dies & try again
                                      xxxx. mkfs ext3 on it, mount /hadoop (both machines)
                                      xxxxi. install all rpms in /hadoop, but don't run them yet
                                     (chkconfig off)
                                      xxxxii. umount /hadoop (!!)
                                      xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat
                               ...
                               40. Restart cluster. If any problems, start at
                                   /var/log/ha.log for hints on what went wrong.


*As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration
 Author: Christophe Bisciglia, Cloudera.
                6/29/2011       © MapR Technologies, Inc.                                 16
Step Back & Rethink Problem
Big disruption in hardware landscape
                             Year 2000                Year 2012
        # cores per box              2                   128
        DRAM per box               4GB                 512GB
        # disks per box           250+                   12
        Disk capacity             18GB                   6TB
        Cluster size               2-10                10,000

   No spin-locks / mutexes, 10,000+ threads
   Minimal footprint – preserve resources for App
   Rapid re-replication, scale to several Exabytes

        6/29/2011         © MapR Technologies, Inc.               17
MapR's Programming Model
Written in C++ and is asynchronous
       ioMgr->read(…, callbackFunc, void *arg)
Each module runs requests from its request-queue
       One OS thread per cpu-core
       Dispatch: map container-> queue -> cpu-core
       Callback guaranteed to be invoked on same core
             No mutexes needed
       When load increases, add cpu-core + move some queues
        to it
State machines on each queue
       'thread stack' is 4K, 10,000+ threads costs ~40M
       Context-switch is 3 instructions, 250K c.s./core/sec ok!
             6/29/2011       © MapR Technologies, Inc.             18
MapR on Linux
   User-space process, avoids system crashes
   Minimal footprint
        Preserves cpu, memory & resources for app
            uses only 1/5th of system memory
            runs on 1 or 2 cores, others left for app
        Emphasis on efficiency, avoids lots of layering
         raw devices, direct-IO, doesn't use Linux VM
   CPU/memory firewalls implemented
      runaway tasks no longer impact system processes




          6/29/2011        © MapR Technologies, Inc.       19
Random Writing in MapR
                              S1
             Ask for
Client
             64M block                                      NameNode Map
writing                                    Create cont.
 data                                                           S1, S2, S4
                         attach                                 S1, S3
Write                                                           S1, S4, S5
next chunk       S2
                                     Picks master               S2, S4, S5
to S2
                                     and 2 replica slaves       S3
                                                                S2, S3, S5




                                                     S4              S5
                         S3


     6/29/2011           © MapR Technologies, Inc.                         20
MapR's Distributed NameNode
   Distributed transactions to stitch containers together
   Each node uses write-ahead log
       Supports both value-logging and operational-logging
          Value log, record = { disk-offset, old, new }
          Op log, record = { op-details, undo-op, redo-op }
       Recovery in 2 seconds
       'global ids' enable participation in distributed
        transactions




             6/29/2011    © MapR Technologies, Inc.            21
2-Phase Commit Unsuitable
                                                                  App
• BeginTrans .. work .. Commit              C = coordinator
                                                                                Force
                                            P = participant        C            Log
   On app-commit
       C sends prepare to P                         P
        P sends prepare-ack,                                            P
                                                    C
        gives up right to abort                                         C
          Waits for C even across

            crashes/reboots                                             P
                                                         P
       P unlocks only when                                             C
        commit received
Too many message exchanges
                                                              P             P
Single failure can lock up entire cluster
          6/29/2011      © MapR Technologies, Inc.                              22
Quorum-completion Unsuitable
• BeginTrans .. work .. Commit                              C = coordinator
                                                            P = participant
   On app-commit
       C broadcasts prepare                                           P
       If majority responds,                         App
        C commits                                      C
       If not, cluster goes                                               P
        into election mode
       If no majority found, all fails                P
Update throughput very poor                                        P

Does not work with < N/2 nodes
Monolithic. Hierarchical? Cycles? Oh No!!
          6/29/2011       © MapR Technologies, Inc.                            23
MapR Lockless Transactions
• BeginTrans + work + Commit
   No explicit commit                                       NN1
                                                            NN1
                                                            NN1
   Uses rollback
       confirm callback, piggy-backed
       Undo on confirmed failure                     NN4
                                                      NN4          NN2
                                                                    NN2
       Any replica can confirm                                      NN2
Update throughput very high
No locks held across messages
Crash resistant, cycles OK                                     NN3
                                                              NN3
                                                            NN3
Patent pending

           6/29/2011      © MapR Technologies, Inc.                    24
Small Files (Apache Hadoop, 10 nodes)

                                     Out of box
                                                                     Op: - create file
Rate (files/sec)




                                                                         - write 100 bytes
                                                      Tuned              - close
                                                                     Notes:
                                                                     - NN not replicated
                                                                     - NN uses 20G DRAM
                                                                     - DN uses 2G DRAM



                                  # of files (m)

                      6/29/2011          © MapR Technologies, Inc.                         25
MapR Distributed NameNode
Same 10 nodes, but with 3x replication added …

                                                     Test
                                                   stopped
 Create                                              here
  Rate

100-byte
files/sec




                           # of files (millions)
          6/29/2011    © MapR Technologies, Inc.             26
MapR's Data Integrity
   End-to-end check-sums on all data (not optional)
      Computed in client's memory, written to disk at server
      On read, validated at both client & server


   RPC packets have own independent check-sum
      Detects RPC msg corruption


   Transactional with ACID semantics
      Meta data incl. log itself is check-summed


      Allocation bitmaps written to two places (dual blocks)


   Automatic compression built-in


        6/29/2011      © MapR Technologies, Inc.           27
MapR’s Random-Write Eases Data Import
With MapR, use NFS                  Otherwise, use Flume/Scribe
1. mount /mapr                      1. Set up sinks (find unused
   real-time, HA                        machines??)
                                    2. Set up intrusive agents
                                        i. tail(“xxx”), tailDir(“y”)
                                        ii. agentBESink
                                    3. All reliability levels lose data
                                        i. best-effort
                                        ii. one-shot
                                        iii. disk fail-over
                                        iv. end-to-end
                                    4. Data not available now


          6/29/2011   © MapR Technologies, Inc.                           28
MapR's Streaming Performance
        2250                                      2250
                     11 x 7200rpm SATA                                    11 x 15Krpm SAS
        2000                                      2000
        1750                                      1750
        1500                                      1500
        1250                                      1250                              Hardware
                                                                                    MapR
        1000                                      1000
MB                                                                                  Hadoop
         750                                       750
per
sec      500                                       500
         250                                       250
           0                                          0
                   Read          Write                          Read   Write
                                         Higher is better


      Tests:    i. 16 streams x 120GB            ii. 2000 streams x 1GB

               6/29/2011            © MapR Technologies, Inc.                         29
HBase on MapR
                YCSB Insert with 1 billion 1K records
          10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
          600

          500

          400
 1000
records   300                                                MapR
  per                                                        Apache
second    200

          100

            0
                      WAL off                    WAL on     Higher is better


          6/29/2011             © MapR Technologies, Inc.                      30
HBase on MapR
          YCSB Random Read with 1 billion 1K records
          10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
          25000

          20000

Records   15000
  per                                                       MapR
second                                                      Apache
          10000

           5000

               0
                       Zipfian                 Uniform    Higher is better


          6/29/2011          © MapR Technologies, Inc.                       31
Terasort on MapR
      10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
          60                                      300

          50                                      250

          40                                      200

Elapsed                                           150
                                                                         MapR
          30
time                                                                     Hadoop
(mins)    20                                      100

          10                                       50


           0                                         0
                           1.0 TB                               3.5 TB

                                         Lower is better

               6/29/2011            © MapR Technologies, Inc.            32
PigMix on MapR
       4000

       3500

       3000

       2500

       2000
Time                                                     MapR
in     1500                                              Hadoop
Sec
       1000

        500

          0




                                 Lower is better
              6/29/2011      © MapR Technologies, Inc.    33
Summary
   Fully HA
       JobTracker, Snapshot, Mirrors, multi-cluster capable
   Super simple to manage
       NFS mountable
   Complete read/write semantics
       Can see file contents immediately
   MapR has signed Apache CCLA
       Zookeeper, Mahout, YCSB, HBase fixes contributed
       Continue to contribute more and more
   Download it at www.mapr.com
         6/29/2011      © MapR Technologies, Inc.              34

Mais conteúdo relacionado

Mais procurados

SQL Server Integration Services
SQL Server Integration ServicesSQL Server Integration Services
SQL Server Integration ServicesRobert MacLean
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analyticsNatalino Busa
 
Starring sakila my sql university 2009
Starring sakila my sql university 2009Starring sakila my sql university 2009
Starring sakila my sql university 2009David Paz
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Seriesselvaraaju
 
Introduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its MethodsIntroduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its MethodsIJSRD
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
All course slides.pdf
All course slides.pdfAll course slides.pdf
All course slides.pdfssuser98bffa1
 
SQL Server 2019 Big Data Cluster
SQL Server 2019 Big Data ClusterSQL Server 2019 Big Data Cluster
SQL Server 2019 Big Data ClusterMaximiliano Accotto
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Examplekailash shaw
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...SindhuVasireddy1
 
Scylla core dump debugging tools
Scylla core dump debugging toolsScylla core dump debugging tools
Scylla core dump debugging toolsTomasz Grabiec
 

Mais procurados (20)

SQL Server Integration Services
SQL Server Integration ServicesSQL Server Integration Services
SQL Server Integration Services
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Presto
PrestoPresto
Presto
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analytics
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Big data
Big dataBig data
Big data
 
Starring sakila my sql university 2009
Starring sakila my sql university 2009Starring sakila my sql university 2009
Starring sakila my sql university 2009
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Series
 
Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
 
Introduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its MethodsIntroduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its Methods
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
All course slides.pdf
All course slides.pdfAll course slides.pdf
All course slides.pdf
 
SQL Server 2019 Big Data Cluster
SQL Server 2019 Big Data ClusterSQL Server 2019 Big Data Cluster
SQL Server 2019 Big Data Cluster
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
 
Scylla core dump debugging tools
Scylla core dump debugging toolsScylla core dump debugging tools
Scylla core dump debugging tools
 

Destaque

Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessMongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
Back to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsBack to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsMongoDB
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBMongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB
 
Back to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationBack to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationMongoDB
 

Destaque (10)

Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your Business
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Back to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsBack to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica Sets
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Back to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationBack to Basics: My First MongoDB Application
Back to Basics: My First MongoDB Application
 

Semelhante a Design, Scale and Performance of MapR's Distribution for Hadoop

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's EvolutionDataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...inside-BigData.com
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Open Stack
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 

Semelhante a Design, Scale and Performance of MapR's Distribution for Hadoop (20)

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 

Último

UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 

Último (20)

UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 

Design, Scale and Performance of MapR's Distribution for Hadoop

  • 1. Design, Scale & Performance of the MapR Distribution M.C. Srivas CTO, MapR Technologies, Inc. 6/29/2011 © MapR Technologies, Inc. 1
  • 2. Outline of Talk • What does MapR do? • Motivation: why build this? • Distributed NameNode Architecture • Scalability factors • Programming model • Distributed transactions in MapR • Performance across a variety of loads 6/29/2011 © MapR Technologies, Inc. 2
  • 3. Complete Distribution  Integrated, tested, hardened  Super simple  Unique advanced features  100% compatible with MapReduce, HBase, HDFS APIs  No recompile required, drop in and use now 6/29/2011 © MapR Technologies, Inc. 3
  • 4. MapR Areas of Development HBase Map Reduce Ecosystem Storage Management Services 6/29/2011 © MapR Technologies, Inc. 4
  • 5. JIRAs Open For Year(s) • HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal • HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize block-reports • HDFS-222, 950 – Concatenate files into larger files • Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible." • HDFS Append – no 'blessed' Apache Hadoop distro has fix • HDFS-233 – 25/Jun/08 – Snapshot support • Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly only if there is complete separation between namespace management and block management." 6/29/2011 © MapR Technologies, Inc. 5
  • 6. Observations on Apache Hadoop  Inefficient HDFS-347 1200 MB/sec  Scaling problems HDFS-273 1000  NameNode bottleneck HDFS-395 800  Limited number of files HDFS-222 600 READ WRITE  Admin overhead significant 400  NameNode failure loses data 200  Not trusted as permanent store 0 HARDWARE HDFS  Write-once  Data lost unless file closed  hflush/hsync – unrealistic to expect folks will re-write apps 6/29/2011 © MapR Technologies, Inc. 6
  • 7. MapR Approach • Some are architectural issues • Change at that level is a big deal – Will not be accepted unless proven – Hard to prove without building it first • Build it and prove it – Improve reliability significantly – Make it tremendously faster at the same time – Enable new class of apps (eg, real-time analytics) 6/29/2011 © MapR Technologies, Inc. 7
  • 8. HDFS Architecture Review  Files are broken into blocks  Distributed across data-nodes  NameNode holds (in memory)  Directories, Files Files  Block replica locations sharded into blocks  Data Nodes  Serve blocks  No idea about files/dirs  All ops go to NN DataNodes save Blocks 6/29/2011 © MapR Technologies, Inc. 8
  • 9. HDFS Architecture Review DataNode (DN) reports blocks to NameNode NameNode (NN)  Large DN does 60K blocks/report  256M x 60K = 15T = 5 disks @ 3T per DataNode DataNode  >100K causes extreme load  40GB NN restart takes 1-2 hours Addressing Unit is an individual block  Flat block-address forces DN's to send giant block-reports  NN can hold about ~300M blocks max  Limits cluster size to 10's of Petabytes  Increasing block size negatively impacts map/reduce 6/29/2011 © MapR Technologies, Inc. 9
  • 10. How to Scale • Central meta server does not scale – Make every server a meta-data server too – But need memory for map/reduce • Must page meta-data to disk • Reduce size of block-reports – while increasing number of blocks per DN • Reduce memory footprint of location service – cannot add memory indefinitely • Need fast-restart (HA) 6/29/2011 © MapR Technologies, Inc. 10
  • 11. MapR Goal: Scale to 1000X HDFS MapR # files 150 million 1 trillion # data 10-50 PB 1-10 Exabytes # nodes 2000 10,000+ Full random read/write semantics  export via NFS and other protocols  with enterprise-class reliability: instant-restart, snapshots, mirrors, no-single-point-of-failure, … Run close to hardware speeds  On extreme scale, efficiency matters extremely  exploit emerging technology like SSD, 10GE 6/29/2011 © MapR Technologies, Inc. 11
  • 12. MapR's Distributed NameNode Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks  Each container contains  Directories & files  Data blocks  Replicated on servers Containers are 16-  No need to manage 32 GB segments of directly disk, placed on  Use MapR Volumes nodes Patent Pending 6/29/2011 © MapR Technologies, Inc. 12
  • 13. MapR Volumes Significant advantages over “Cluster- /projects wide” or “File-level” approaches /tahoe Volumes allow management attributes /yosemite to be applied in a scalable way at a very granular level and with flexibility /user /msmith • Replication factor • Scheduled mirroring /bjohnson • Scheduled snapshots • Data placement control 100K volumes are OK, • Usage tracking create as many as • Administrative permissions desired! 6/29/2011 © MapR Technologies, Inc. 13
  • 14. MapR Distributed NameNode Containers are tracked globally • Clients cache containers & server info for extended periods NameNode Map S1, S2, S4 Client S1 Fetches Contacts S1, S3 container server to S1, S4, S5 locations read data S2, S3, S5 from the S3 container S2, S4, S5 S3 S4 S5 S2 6/29/2011 © MapR Technologies, Inc. 14
  • 15. MapR's Distr NameNode Scaling Containers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster) 250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster  But not necessary, can page to disk  Typical large 10PB cluster needs 2GB Container-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  Map/reduce not affected 6/29/2011 © MapR Technologies, Inc. 15
  • 16. MapR Distr NameNode HA MapR Apache Hadoop* 1. apt-get install mapr-cldb 1. Stop cluster very carefully while cluster is online 2. Move fs.checkpoint.dir onto NAS (eg. NetApp) 3. Install, configure DRBD + Heartbeat packages i. yum -y install drbd82 kmod-drbd82 heartbeat ii. chkconfig -add heartbeat (both machines) iii. edit /etc/drbd.conf on 2 machines iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero it if drbd dies & try again xxxx. mkfs ext3 on it, mount /hadoop (both machines) xxxxi. install all rpms in /hadoop, but don't run them yet (chkconfig off) xxxxii. umount /hadoop (!!) xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat ... 40. Restart cluster. If any problems, start at /var/log/ha.log for hints on what went wrong. *As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration Author: Christophe Bisciglia, Cloudera. 6/29/2011 © MapR Technologies, Inc. 16
  • 17. Step Back & Rethink Problem Big disruption in hardware landscape Year 2000 Year 2012 # cores per box 2 128 DRAM per box 4GB 512GB # disks per box 250+ 12 Disk capacity 18GB 6TB Cluster size 2-10 10,000  No spin-locks / mutexes, 10,000+ threads  Minimal footprint – preserve resources for App  Rapid re-replication, scale to several Exabytes 6/29/2011 © MapR Technologies, Inc. 17
  • 18. MapR's Programming Model Written in C++ and is asynchronous ioMgr->read(…, callbackFunc, void *arg) Each module runs requests from its request-queue  One OS thread per cpu-core  Dispatch: map container-> queue -> cpu-core  Callback guaranteed to be invoked on same core  No mutexes needed  When load increases, add cpu-core + move some queues to it State machines on each queue  'thread stack' is 4K, 10,000+ threads costs ~40M  Context-switch is 3 instructions, 250K c.s./core/sec ok! 6/29/2011 © MapR Technologies, Inc. 18
  • 19. MapR on Linux  User-space process, avoids system crashes  Minimal footprint  Preserves cpu, memory & resources for app  uses only 1/5th of system memory  runs on 1 or 2 cores, others left for app  Emphasis on efficiency, avoids lots of layering raw devices, direct-IO, doesn't use Linux VM  CPU/memory firewalls implemented  runaway tasks no longer impact system processes 6/29/2011 © MapR Technologies, Inc. 19
  • 20. Random Writing in MapR S1 Ask for Client 64M block NameNode Map writing Create cont. data S1, S2, S4 attach S1, S3 Write S1, S4, S5 next chunk S2 Picks master S2, S4, S5 to S2 and 2 replica slaves S3 S2, S3, S5 S4 S5 S3 6/29/2011 © MapR Technologies, Inc. 20
  • 21. MapR's Distributed NameNode  Distributed transactions to stitch containers together  Each node uses write-ahead log  Supports both value-logging and operational-logging  Value log, record = { disk-offset, old, new }  Op log, record = { op-details, undo-op, redo-op }  Recovery in 2 seconds  'global ids' enable participation in distributed transactions 6/29/2011 © MapR Technologies, Inc. 21
  • 22. 2-Phase Commit Unsuitable App • BeginTrans .. work .. Commit C = coordinator Force P = participant C Log  On app-commit  C sends prepare to P P P sends prepare-ack, P  C gives up right to abort C  Waits for C even across crashes/reboots P P  P unlocks only when C commit received Too many message exchanges P P Single failure can lock up entire cluster 6/29/2011 © MapR Technologies, Inc. 22
  • 23. Quorum-completion Unsuitable • BeginTrans .. work .. Commit C = coordinator P = participant  On app-commit  C broadcasts prepare P  If majority responds, App C commits C  If not, cluster goes P into election mode  If no majority found, all fails P Update throughput very poor P Does not work with < N/2 nodes Monolithic. Hierarchical? Cycles? Oh No!! 6/29/2011 © MapR Technologies, Inc. 23
  • 24. MapR Lockless Transactions • BeginTrans + work + Commit  No explicit commit NN1 NN1 NN1  Uses rollback  confirm callback, piggy-backed  Undo on confirmed failure NN4 NN4 NN2 NN2  Any replica can confirm NN2 Update throughput very high No locks held across messages Crash resistant, cycles OK NN3 NN3 NN3 Patent pending 6/29/2011 © MapR Technologies, Inc. 24
  • 25. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file Rate (files/sec) - write 100 bytes Tuned - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM # of files (m) 6/29/2011 © MapR Technologies, Inc. 25
  • 26. MapR Distributed NameNode Same 10 nodes, but with 3x replication added … Test stopped Create here Rate 100-byte files/sec # of files (millions) 6/29/2011 © MapR Technologies, Inc. 26
  • 27. MapR's Data Integrity  End-to-end check-sums on all data (not optional)  Computed in client's memory, written to disk at server  On read, validated at both client & server  RPC packets have own independent check-sum  Detects RPC msg corruption  Transactional with ACID semantics  Meta data incl. log itself is check-summed  Allocation bitmaps written to two places (dual blocks)  Automatic compression built-in 6/29/2011 © MapR Technologies, Inc. 27
  • 28. MapR’s Random-Write Eases Data Import With MapR, use NFS Otherwise, use Flume/Scribe 1. mount /mapr 1. Set up sinks (find unused real-time, HA machines??) 2. Set up intrusive agents i. tail(“xxx”), tailDir(“y”) ii. agentBESink 3. All reliability levels lose data i. best-effort ii. one-shot iii. disk fail-over iv. end-to-end 4. Data not available now 6/29/2011 © MapR Technologies, Inc. 28
  • 29. MapR's Streaming Performance 2250 2250 11 x 7200rpm SATA 11 x 15Krpm SAS 2000 2000 1750 1750 1500 1500 1250 1250 Hardware MapR 1000 1000 MB Hadoop 750 750 per sec 500 500 250 250 0 0 Read Write Read Write Higher is better Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB 6/29/2011 © MapR Technologies, Inc. 29
  • 30. HBase on MapR YCSB Insert with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 600 500 400 1000 records 300 MapR per Apache second 200 100 0 WAL off WAL on Higher is better 6/29/2011 © MapR Technologies, Inc. 30
  • 31. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 25000 20000 Records 15000 per MapR second Apache 10000 5000 0 Zipfian Uniform Higher is better 6/29/2011 © MapR Technologies, Inc. 31
  • 32. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm 60 300 50 250 40 200 Elapsed 150 MapR 30 time Hadoop (mins) 20 100 10 50 0 0 1.0 TB 3.5 TB Lower is better 6/29/2011 © MapR Technologies, Inc. 32
  • 33. PigMix on MapR 4000 3500 3000 2500 2000 Time MapR in 1500 Hadoop Sec 1000 500 0 Lower is better 6/29/2011 © MapR Technologies, Inc. 33
  • 34. Summary  Fully HA  JobTracker, Snapshot, Mirrors, multi-cluster capable  Super simple to manage  NFS mountable  Complete read/write semantics  Can see file contents immediately  MapR has signed Apache CCLA  Zookeeper, Mahout, YCSB, HBase fixes contributed  Continue to contribute more and more  Download it at www.mapr.com 6/29/2011 © MapR Technologies, Inc. 34