SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
SSD Deployment Strategies
                         for MySQL


                                      Yoshinori Matsunobu

                              Lead of MySQL Professional Services APAC
                                          Sun Microsystems
                                    Yoshinori.Matsunobu@sun.com


Copyright 2010 Sun Microsystems inc                         The World’s Most Popular Open Source Database   1
What do you need to consider? (H/W layer)

       • SSD or HDD?
       • Interface
              – SATA/SAS or PCI-Express?
       • RAID
              – H/W RAID, S/W RAID or JBOD?
       • Network
              – Is 1GbE enough?
       • Memory
              – Is 2GB RAM + PCI-E SSD faster than 64GB RAM +
                8HDDs?
       • CPU
              – Nehalem or older Xeon?

Copyright 2010 Sun Microsystems inc              The World’s Most Popular Open Source Database   2
What do you need to consider?

       • Redundancy
              –   RAID
              –   DRBD (network mirroring)
              –   Semi-Sync MySQL Replication
              –   Async MySQL Replication

       • Filesystem
              – ext3, xfs, raw device ?


       • File location
              – Data file, Redo log file, etc


       • SSD specific issues
              – Write performance deterioration
              – Write endurance
Copyright 2010 Sun Microsystems inc               The World’s Most Popular Open Source Database   3
Why SSD? IOPS!
 •    IOPS: Number of (random) disk i/o operations per second

 •    Almost all database operations require random access
        –   Selecting records by index scan
        –   Updating records
        –   Deleting records
        –   Modifying indexes

 •    Regular SAS HDD : 200 iops per drive (disk seek & rotation is slow)

 •    SSD : 2,000+ (writes) / 5,000+ (reads) per drive
        – highly depending on SSDs and device drivers

 •    Let’s start from basic benchmarks



Copyright 2010 Sun Microsystems inc                     The World’s Most Popular Open Source Database   4
Tested HDD/SSD for this session

       • SSD
              – Intel X25-E (SATA, 30GB, SLC)
              – Fusion I/O (PCI-Express, 160GB, SLC)


       • HDD
              – Seagate 160GB SAS 15000RPM




Copyright 2010 Sun Microsystems inc               The World’s Most Popular Open Source Database   5
Table of contents

       • Basic Performance on SSD/HDD
              –   Random Reads
              –   Random Writes
              –   Sequential Reads
              –   Sequential Writes
              –   fsync() speed
              –   Filesystem difference
              –   IOPS and I/O unit size


       • MySQL Deployments




Copyright 2010 Sun Microsystems inc                 The World’s Most Popular Open Source Database   6
Random Read benchmark
                          Direct Random Read IOPS (Single Drive, 16KB, xfs)
              45000
              40000
              35000
              30000
              25000                                                                                   HDD
       IOPS




              20000                                                                                   Intel SSD
              15000                                                                                   Fusion I/O
              10000
               5000
                  0
                      1     2   3     4   5   6      8 10 15 20      30 40    50 100 200
                                                  # of I/O threads


•   HDD: 196 reads/s at 1 i/o thread, 443 reads/s at 100 i/o threads
•   Intel : 3508 reads/s at 1 i/o thread, 14538 reads/s at 100 i/o threads
•   Fusion I/O : 10526 reads/s at 1 i/o thread, 41379 reads/s at 100 i/o threads
•   Single thread throughput on Intel is 16x better than on HDD, Fusion is 25x better
•   SSD’s concurrency (4x) is much better than HDD’s (2.2x)
•   Very strong reason to use SSD
Copyright 2010 Sun Microsystems inc                                          The World’s Most Popular Open Source Database   7
High Concurrency




   • Single SSD drive has multiple NAND Flash Memory chips
     (i.e. 40 x 4GB Flash Memory = 160GB)
   • Highly depending on I/O controller and Applications
          – Single threaded application can not gain concurrency advantage
Copyright 2010 Sun Microsystems inc                    The World’s Most Popular Open Source Database   8
PCI-Express SSD
                           CPU


                   North Bridge                              South Bridge
               PCI-Express Controller                     SAS/SATA Controller

                                2GB/s (PCI-Express x 8)                     300MB/s
                 SSD I/O Controller                        SSD I/O Controller

                          Flash                                   Flash


        •    Advantage
               – PCI-Express is much faster interface than SAS/SATA

        •    (current) Disadvantages
               – Most motherboards have limited # of PCI-E slots
               – No hot swap mechanism
Copyright 2010 Sun Microsystems inc                            The World’s Most Popular Open Source Database   9
Write performance on SSD
                                       Random Write IOPS (16KB Blocks)

      20000
      18000
      16000
      14000
      12000
                                                                                             1 i/o thread
      10000
                                                                                             100 i/o threads
       8000
       6000
       4000
       2000
          0
                   HDD(4 RAID10 xfs)         Intel(xfs)          Fusion (xfs)


        •    Very strong reason to use SSD
        •    But wait.. Can we get a high write throughput *anytime*?
               – Not always.. Let’s check how data is written to Flash Memory

Copyright 2010 Sun Microsystems inc                                The World’s Most Popular Open Source Database   10
Understanding how data is written to SSD (1)
                                       Block (empty) Block (empty)


                                       Block (empty) Block     Page
                                                               Page

                                               ….
     Flash memory chips

 •    Single SSD drive consists of many flash memory chips (i.e. 2GB)
 •    A flash memory chip internally consists of many blocks (i.e. 512KB)
 •    A block internally consists of many pages (i.e. 4KB)
 •    It is *not* possible to overwrite to a non-empty block
        –   Reading from pages is possible
        –   Writing to pages in an empty block is possible
        –   Appending is possible
        –   Overwriting to pages in a non-empty block is *not* possible
Copyright 2010 Sun Microsystems inc                       The World’s Most Popular Open Source Database   11
Understanding how data is written to SSD (2)
                      Block (empty) Block (empty)

                                                                             New data
                      Block (empty) Block
                                                            ×
                                             Page
                                             Page

                                      ….




  •    Overwriting to a non-empty block is not possible
  •    Writing new data to an empty block instead
  •    Writing to a non-empty block is fast (-200 microseconds)
  •    Even though applications write to same positions in same files (i.e. InnoDB Log
       File), written pages/blocks are distributed (Wear-Leveling)




Copyright 2010 Sun Microsystems inc                     The World’s Most Popular Open Source Database   12
Understanding how data is written to SSD (3)
                        Block         P   Block   P   Block   P
                                      P           P           P
                                                                     1. Reading all pages
                        Block         P   Block   P   Block   P
                                      P           P    New    P
                                                                             2. Erasing the block
                        Block             Block   P   Block   P
                                      P           P           P                     3. Writing all data
                                                                                                   P     P
    • In the long run, almost all blocks will be fully used
                                                                                              New        P
           – i.e. Allocating 158GB files on 160GB SSD
    • New empty block must be allocated on writes
    • Basic steps to write new data:
           – 1. Reading all pages from a block
           – 2. ERASE the block
           – 3. Writing all data w/ new data into the block
    • ERASE is very expensive operation (takes a few milliseconds)
    • At this stage, write performance becomes very slow because of
      massive ERASE operations
Copyright 2010 Sun Microsystems inc                               The World’s Most Popular Open Source Database   13
Data Space                                    Reserved Space                      Reserved Space
        Block        P      Block         P   Block   P       Block (empty)

                     P                    P           P
        Block        P      Block         P   Block   P       Block (empty)

                     P                    P           P
        Block               Block         P   Block   P             2. Writing data
                     P                    P           P
                                          1. Reading pages               P
                                                             New data
Background jobs ERASE unused blocks                                      P

•    To keep high enough write performance, SSDs have a feature
     of “reserved space”
•    Data size visible to applications is limited to
     the size of data space
       – i.e. 160GB SSD, 120GB data space, 40GB reserved space
•    Fusion I/O has a functionality to change reserved space size
       – # fio-format -s 96G /dev/fct0
    Copyright 2010 Sun Microsystems inc                                 The World’s Most Popular Open Source Database   14
Write performance deterioration
                                         Write IOPS deterioration (16KB random write)

                       30000                  Continuous write-intensive workloads
                       25000

                       20000
                IOPS




                                                                                                         Fastest
                       15000
                                                                                                         Slowest
                       10000

                        5000
         Stopping writing for a while
                          0
                                 Intel   Fusion(150G) Fusion(120G) Fusion(96G)          Fusion(80G)

  •    At the beginning, write IOPS was close to “Fastest” line
  •    When massive writes happened, write IOPS gradually deteriorated toward
       “Slowest” line (because massive ERASE happened)
  •    Increasing reserved space improves steady-state write throughput
  •    Write IOPS recovered to “Fastest” when stopping writes for a long time
       (Many blocks were ERASEd by background job)
  •    Highly depending on Flash memory and I/O controller (TRIM support,
       ERASE scheduling, etc)
Copyright 2010 Sun Microsystems inc                                            The World’s Most Popular Open Source Database   15
Sequential I/O
                             Sequential Read/Write throughput (1MB consecutive reads/writes)

                  600

                  500

                  400
           MB/s




                                                                                                       Seq read
                  300
                                                                                                       Seq write
                  200

                  100

                   0
                         4 HDD(raid10, xfs)           Intel(xfs)             Fusion(xfs)

       • Typical scenario: Full table scan (read), logging/journaling (write)
       • SSD outperforms HDD for sequential reads, but less significant
       • HDD (4 RAID10) is fast enough for sequential i/o
       • Data transfer size by sequential writes tends to be huge, so you
         need to care about write deterioration on SSD
       • No strong reason to use SSD for sequential writes
Copyright 2010 Sun Microsystems inc                                        The World’s Most Popular Open Source Database   16
fsync() speed
                                            fsync speed

                   20000
                   18000
                   16000
                   14000
       fsync/sec




                   12000                                                                   1KB
                   10000                                                                   8KB
                    8000                                                                   16KB
                    6000
                    4000
                    2000
                       0
                              HDD(xfs)        Intel (xfs)    Fusion I/O(xfs)



       • 10,000+ fsync/sec is fine in most cases
       • Fusion I/O was CPU bound (%system), not I/O bound
         (%iowait).

Copyright 2010 Sun Microsystems inc                         The World’s Most Popular Open Source Database   17
HDD is fast for sequential writes / fsync

 • Best Practice: Writes can be boosted by using BBWC
    (Battery Backed up Write Cache), especially for REDO
   Logs (because it’s sequentially written)
 • No strong reason to use SSDs here

       seek & rotation time

                                                             Write cache



                       disk
                                                                 disk

                                      seek & rotation time


Copyright 2010 Sun Microsystems inc                    The World’s Most Popular Open Source Database   18
Filesystem matters
                                            Random write iops (16KB Blocks)

                 20000
                 18000
                 16000
                 14000
                 12000
                                                                                                         1 thread
          iops




                 10000
                  8000                                                                                   16 thread
                  6000
                  4000
                  2000
                     0
                             Fusion(ext3)            Fusion (xfs)             Fusion (raw)
                                                     Filesystem


       • On xfs, multiple threads can write to the same file if opened with
         O_DIRECT, but can not on ext*
       • Good concurrency on xfs, close to raw device
       • ext3 is less optimized for Fusion I/O

Copyright 2010 Sun Microsystems inc                                           The World’s Most Popular Open Source Database   19
Changing I/O unit size
                                      Read IOPS and I/O unit size (4 HDD RAID10)

        2500
        2000
                                                                                                                1KB
        1500
 IOPS




                                                                                                                4KB
        1000
                                                                                                                16KB
        500
          0
                 1      2     3       4    5    6    8   10   15   20   30    40    50 100 200
                                                    concurrency




        • On HDD, maximum 22% performance difference was found
          between 1KB and 16KB
        • No big difference when concurrency < 10

Copyright 2010 Sun Microsystems inc                                      The World’s Most Popular Open Source Database   20
Changing I/O unit size on SSD
                                      Read IOPS and I/O unit size (Fusion I/O)

        200000

        150000
                                                                                                                1KB
 IOPS




        100000                                                                                                  4KB
                                                                                                                16KB
         50000

               0
                     1      2     3     4   5    6    8 10 15 20 30 40 50 100 200
                                                     concurrency


        • Huge difference
        • On SSDs, not only IOPS, but also I/O transfer size matters
        • It’s worth considering that Storage Engines support
          “configurable block size” functionality

Copyright 2010 Sun Microsystems inc                                   The World’s Most Popular Open Source Database    21
Let’s start MySQL benchmarking
   • Base: Disk-bound application (DBT-2) running on:
          –   Sun Fire X4270
          –   Nehalem 8 Core
          –   4 HDD
          –   RAID1+0, Write Cache with Battery


   • What will happen if …
          –   Replacing HDD with Intel SSD (SATA)
          –   Replacing HDD with Fusion I/O (PCI-E)
          –   Moving log files and ibdata to HDD
          –   Not using Nehalem
          –   Using two Fusion I/O drives with Software RAID1
          –   Deploying DRBD protocol B or C
                 • Replacing 1GbE with 10GbE
          – Using MySQL 5.5.4

Copyright 2010 Sun Microsystems inc                 The World’s Most Popular Open Source Database   22
DBT-2 condition

       •    SuSE Enterprise Linux 11, xfs
       •    MySQL 5.5.2M2 (InnoDB Plugin 1.0.6)
       •    200 Warehouses (20GB – 25GB hot data)
       •    Buffer pool size
              –   1GB
              –   2GB
              –   5GB
              –   30GB (large enough to cache all data)


       • 1000 seconds warm up time
       • Running 3600 seconds (1 hour)
       • Fusion I/O: 96GB data space, 64GB reserved space
Copyright 2010 Sun Microsystems inc                   The World’s Most Popular Open Source Database   23
HDD vs Intel SSD
                                      HDD       Intel
   Buffer pool 1G                     1125.44   5709.06 (NOTPM: Transactions
                                                per minute)



           •    Storing all data on HDD or Intel SSD
           •    Massive disk i/o happens
                  – Random reads for all accesses
                  – Random writes for updating rows and indexes
                  – Sequential writes for REDO log files, etc
           •    SSD is very good at these kinds of workloads
           •    5.5 times performance improvement, without any application
                change!


Copyright 2010 Sun Microsystems inc                      The World’s Most Popular Open Source Database   24
HDD vs Intel SSD vs Fusion I/O

                               HDD       Intel            Fusion I/O
  Buffer pool 1G               1125.44   5709.06          15122.75




              •    Fusion I/O is a PCI-E based SSD
              •    PCI-E is much faster than SAS/SATA
              •    14x improvement compared to 4HDDs




Copyright 2010 Sun Microsystems inc                 The World’s Most Popular Open Source Database   25
Which should we spend money, RAM or SSD?
                                      HDD        Intel               Fusion I/O
  Buffer pool 1G                      1125.44    5709.06             15122.75
  Buffer pool 2G                      1863.19
  Buffer pool 5G                      4385.18
  Buffer pool 30G                     36784.76
  (Caching all hot
  data)
         •    Increasing RAM (buffer pool size) reduces random disk reads
                – Because more data are cached in the buffer pool
         •    If all data are cached, only disk writes (both random and
              sequential) happen
         •    Disk writes happen asynchronously, so application queries can
              be much faster
         •    Large enough RAM + HDD outperforms too small RAM + SSD
Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   26
Which should we spend money, RAM or SSD?
                                HDD           Intel                  Fusion I/O
  Buffer pool 1G                1125.44       5709.06                15122.75
  Buffer pool 2G                1863.19       7536.55                20096.33
  Buffer pool 5G                4385.18       12892.56               30846.34
  Buffer pool 30G 36784.76                     -                     57441.64
  (Caching all hot
  data)

         •    It is not always possible to cache all hot data
         •    Fusion I/O + good amount of memory (5GB) was pretty good

         •    Basic rule can be:
                – If you can cache all active data, large enough RAM + HDD
                – If you can’t, or if you need extremely high throughput, spending on
                  both RAM and SSD
Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   27
Let’s think about MySQL file location
  •    SSD is extremely good at random reads
  •    SSD is very good at random writes
  •    HDD is good enough at sequential reads/writes
  •    No strong reason to use SSD for sequential writes

  •    Random I/O oriented:
         – Data Files (*.ibd)
                • Sequential reads if doing full table scan
         – Undo Log, Insert Buffer (ibdata)
                • UNDO tablespace (small in most cases, except for running long-running batch)
                • On-disk insert buffer space (small in most cases, except that InnoDB can not
                  catch up with updating indexes)

  •    Sequential Write oriented:
         – Doublewrite Buffer (ibdata)
                • Write volume is equal to *ibd files. Huge
         – Binary log (mysql-bin.XXXXXX)
         – Redo log (ib_logfile)
         – Backup files
Copyright 2010 Sun Microsystems inc                            The World’s Most Popular Open Source Database   28
Moving sequentially written files into HDD
                                  Fusion I/O           Fusion I/O + HDD                   Up
    Buffer pool 1G                15122.75             19295.94                           +28%
                                  (us=25%, wa=15%)     (us=32%, wa=10%)
    Buffer pool 2G                20096.33             25627.49                           +28%
                                  (us=30%, wa=12.5%)   (us=36%, wa=8%)
    Buffer pool 5G                30846.34             39435.25                           +28%
                                  (us=39%, wa=10%)     (us=49%, wa=6%)
    Buffer pool 30G 57441.64                           66053.68                           +15%
                                  (us=70%, wa=3.5%)    (us=77%, wa=1%)

         •    Moving ibdata, ib_logfile, (+binary logs) into HDD
         •    High impact on performance
                – Write volume to SSD becomes half because doublewrite area is
                  allocated in HDD
                – %iowait was significantly reduced
                – You can delay write performance deterioration
Copyright 2010 Sun Microsystems inc                          The World’s Most Popular Open Source Database   29
Does CPU matter?
        Nehalem                                                   Older Xeon

                CPUs           Memory                               CPUs

                      QPI: 25.6GB/s                                       FSB: 10.6GB/s

          North Bridge                                        North Bridge
             (IOH)                            Memory             (MCH)



           PCI-Express                                        PCI-Express


 •      Nehalem has two big advantages
        1. Memory is directly attached to CPU : Faster for in-memory workloads
        2. Interface speed between CPU and North Bridge is 2.5x higher, and
           interface traffics do not conflict with CPU<->Memory workloads : Faster for
           disk i/o workloads when using PCI-Express SSDs
Copyright 2010 Sun Microsystems inc                     The World’s Most Popular Open Source Database   30
Harpertown X5470 (older Xeon) vs Nehalem X5570 (HDD)


   HDD                                Harpertown X5470,   Nehalem(X5570,                  Up
                                      3.33GHz             2.93GHz)
   Buffer pool 1G                     1135.37 (us=1%)     1125.44 (us=1%)                 -1%
   Buffer pool 2G                     1922.23 (us=2%)     1863.19 (us=2%)                 -3%
   Buffer pool 5G                     4176.51 (us=7%)     4385.18(us=7%)                  +5%
   Buffer pool 30G                    30903.4 (us=40%)    36784.76 (us=40%)               +19%

                                                              us: userland CPU utilization


           •    CPU difference matters on CPU bound workloads




Copyright 2010 Sun Microsystems inc                           The World’s Most Popular Open Source Database   31
Harpertown X5470 vs Nehalem X5570 (Fusion)
Fusion I/O+HDD                        Harportown X5470,   Nehalem(X5570,                   Up
                                      3.33GHz             2.93GHz)
Buffer pool 1G                        13534.06 (user=35%) 19295.94 (user=32%) +43%
Buffer pool 2G                        19026.64 (user=40%) 25627.49 (user=37%) +35%
Buffer pool 5G                        30058.48 (user=50%) 39435.25 (user=50%) +31%
Buffer pool 30G                       52582.71 (user=76%) 66053.68 (user=76%) +26%


    • TPM difference was much higher than HDD
    • For disk i/o bound workloads (buffer pool 1G/2G), CPU utilizations
      on Nehalem were smaller, but TPM were much higher
           – Verified that Nehalem is much more efficient for PCI-E workloads
    • Benefit from high interface speed between CPU and PCI-Express
    • Fusion I/O fits with Nehalem much better than with traditional CPUs


Copyright 2010 Sun Microsystems inc                           The World’s Most Popular Open Source Database   32
We need to think about redundancy overhead

       • Single server + No RAID is meaningless in the real
         database world
       • Redundancy
              – RAID 1 / 5 / 10
              – Network mirroring (DRBD)
              – Replication (Sync / Async)
       • Relative overhead for redundancy will be (much)
         higher than on traditional HDD environment




Copyright 2010 Sun Microsystems inc          The World’s Most Popular Open Source Database   33
Fusion I/O + Software RAID1

       • Fusion I/O itself has RAID5 feature
              – Writing parity bits into Flash Memory
              – Flash Chips are not Single Point of Failure
              – Controller / PCI-E Board is Single Point of Failure


       • Right now no H/W RAID controller is provided for
         PCI-E SSDs

       • Using Software RAID1 (or RAID10)
              – Two Fusion I/O drives in the same machine




Copyright 2010 Sun Microsystems inc                    The World’s Most Popular Open Source Database   34
Understanding how software RAID1 works
     H/W RAID1                App/DB                  S/W RAID1                App/DB

            Writing to files                                Writing to files
               on /dev/sdX                                                                Response
                                      Response                on /dev/md0

              Write cache with battery                             Software RAID daemon
                    RAID controller                                 “md0_raid1” process
Background writes                                Writing to disks
      (in parallel)                                   (in parallel)
                                                                      Disk1               Disk2
                       Disk1           Disk2

       • Response time on Software RAID1 is
           max(time-to-write-to-disk1, time-to-write-to-disk2)
       • If either of the two takes time for ERASE, response time will be
         longer
       • On faster storages / faster writes (i.e. sequential write + fsync),
         relative overheads of the software raid process are higher
Copyright 2010 Sun Microsystems inc                           The World’s Most Popular Open Source Database   35
Random Write IOPS, S/W RAID1 vs No-RAID
                          Random Write IOPS (Fusion I/O 160GB SLC, 16KB I/O unit, XFS)

            50000
            45000
            40000
            35000                                                                                No-RAID (120G)
            30000
     IOPS




                                                                                                 S/W RAID1 (120G)
            25000
                                                                                                 No-RAID (96G)
            20000
            15000                                                                                S/W RAID1 (96G)
            10000
             5000
                0
                    1     61          121   181    241     301     361   421     481
                                            Running time (minutes)

•    120GB data space = 40GB additional reserved space
•    96GB data space = 64GB additional reserved space
•    On S/W RAID1, IOPS deteriorated more quickly than on No-RAID
•    On S/W RAID1 with 96GB data space, the slowest line was smaller than No-RAID
•    20-25% performance drop can be expected on disk write bound workloads
Copyright 2010 Sun Microsystems inc                                       The World’s Most Popular Open Source Database   36
What about Reads?
                                                  Read IOPS (16KB Blocks)

             80000
             70000
             60000
             50000
      IOPS




                                                                                                            No-RAID
             40000
                                                                                                            S/W RAID1
             30000
             20000
             10000
                0
                     1     2     3    4   5   6      8 10 15      20   30   40    50 100 200
                                                    concurrency



       • Theoretically reads IOPS can be twice by RAID1
       • Peak IOPS was 43636 on No-RAID, 75627 on RAID, 73% up
       • Good scalability

Copyright 2010 Sun Microsystems inc                                         The World’s Most Popular Open Source Database   37
DBT-2, No-RAID vs S/W RAID on Fusion I/O

                                Fusion I/O+HDD   RAID 1 Fusion           %iowait           Down
                                                 I/O+HDD
Buffer pool 1G                  19295.94         15468.81                10%               -19.8%
Buffer pool 2G                  25627.49         21405.23                8%                -16.5%
Buffer pool 5G                  39435.25         35086.21                6-7%              -11.0%
Buffer pool 30G                 66053.68         66426.52                0-1%              +0.56%




Copyright 2010 Sun Microsystems inc                         The World’s Most Popular Open Source Database   38
Intel SSDs with a traditional H/W raid controller

                                      Single raw Intel   Four RAID5 Intel                    Down
   Buffer pool 1G                     5709.06            2975.04                             -48%
   Buffer pool 2G                     7536.55            4763.60                             -37%
   Buffer pool 5G                     12892.56           11739.27                            -9%

         •    Raw SSD drives performed much better than using a traditional H/W
              raid controller
                – Even on RAID10 performance was worse than single raw drive
                – H/W Raid controller seemed serious bottleneck
                – Make sure SSD drives have write cache and capacitor itself (Intel X25-
                  V/M/E doesn’t have capacitor)
         •    Use JBOD + write cache + capacitor
         •    Research appliances such as Schooner, Gear6, etc
         •    Wait until H/W vendors release great H/R raid controllers that work well
              with SSDs

Copyright 2010 Sun Microsystems inc                                The World’s Most Popular Open Source Database   39
What about DRBD?
       • Single server is not Highly Available
              – Mother Board/RAID Controller/etc are Single Point of Failure
       • Heartbeat + DRBD + MySQL is one of the most
         common HA (Active/Passive) solutions
       • Network might be a bottleneck
              – 1GbE -> 10GbE, InfiniBand, Dolphin Interconnect, etc
       • Replication level
              – Protocol A (async)
              – Protocol B (sync to remote drbd receiver process)
              – Protocol C (sync to remote disk)
       • Network channel is single threaded
              – Storing all data under /data (single DRBD partition) => single
                thread
              – Storing log/ibdata under /hdd, *ibd under /ssd => two
                threads
Copyright 2010 Sun Microsystems inc                   The World’s Most Popular Open Source Database   40
DRBD Overheads on HDD

            HDD                       No DRBD   DRBD Protocol     DRBD Protocol B,
                                                B, 1GbE           10GbE
            Buffer pool 1G            1125.44   1080.8            1101.63
            Buffer pool 2G            1863.19   1824.75           1811.95
            Buffer pool 5G            4385.18   4285.22           4326.22
            Buffer pool 30G           36784.76 32862.81           35689.67



            •     DRBD 8.3.7
            •     DRBD overhead (protocol B) was not big on disk i/o bound
                  workloads
            •     Network bandwidth difference was not big on disk i/o bound
                  workloads

Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   41
DRBD Overheads on Fusion I/O
Fusion I/O+HDD               No DRBD    DRBD Protocol   Down      DRBD Protocol Down
                                        B, 1GbE                   B, 10GbE
Buffer pool 1G               19295.94   5976.18         -69.0% 12107.88                       -37.3%
Buffer pool 2G               25627.49   8100.5          -68.4% 16776.19                       -34.5%
Buffer pool 5G               39435.25   16073.9         -59.2% 30288.63                       -23.2%
Buffer pool 30G              66053.68   37974           -42.5% 62024.68                       -6.1%



    •    DRBD overhead was not negligible
    •    10GbE performed much better than 1GbE
    •    Still 6-10 times faster than HDD
    •    Note: DRBD supports faster interface such as InfiniBand SDP
         and Dolphin Interconnect


Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   42
Misc topic: Insert performance on InnoDB vs MyISAM (HDD)


                                        Time to insert 1 million records (HDD)

          5000
          4000                                                                                            250 rows/s
Seconds




          3000                                                                                               innodb
          2000                                                                                               myisam
          1000
            0
                 1   10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145
                                   Existing records (millions)


            • MyISAM doesn’t do any special i/o optimization like “Insert
              Buffering” so a lot of random reads/writes happen, and highly
              depending on OS
            • Disk seek & rotation overhead is really serious on HDD

  Copyright 2010 Sun Microsystems inc                                        The World’s Most Popular Open Source Database   43
Note: Insert Buffering (InnoDB feature)
                                      •   If non-unique, secondary index blocks are not in
                                          memory, InnoDB inserts entries to a special
                                          buffer(“insert buffer”) to avoid random disk i/o operations
                                           – Insert buffer is allocated on both memory and innodb
                                             SYSTEM tablespace

                                      •   Periodically, the insert buffer is merged into the
                                          secondary index trees in the database (“merge”)


        Insert buffer                 •   Pros: Reducing I/O overhead
                                           – Reducing the number of disk i/o operations by merging i/o
                                             requests to the same block
                   Optimized i/o           – Some random i/o operations can be sequential

                                      •   Cons:
                                           Additional operations are added
                                           Merging might take a very long time
                                           – when many secondary indexes must be updated and many
                                             rows have been inserted.
                                           – it may continue to happen after a server shutdown and
                                             restart
Copyright 2010 Sun Microsystems inc                                   The World’s Most Popular Open Source Database   44
Insert performance: InnoDB vs MyISAM (SSD)
                                        Time to insert 1million records (SSD)

          600
          500                                                                                       2,000 rows/s
          400
Seconds




                                                                                                            InnoDB
          300
                                                                                                            MyISAM
          200
                                                                                                    5,000 rows/s
          100
            0
                1   7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103
                                          Existing records (millions)

           Index size exceeded buffer pool size
                                 Filesystem cache was fully used, disk reads began


            • MyISAM got much faster by just replacing HDD with SSD !


  Copyright 2010 Sun Microsystems inc                                      The World’s Most Popular Open Source Database   45
Try MySQL 5.5.4 !
      Fusion I/O + HDD                 MySQL5.5.2   MySQL5.5.4                     Up
      Buffer pool 1G                   19295.94     24019.32                       +24%
      Buffer pool 2G                   25627.49     32325.76                       +26%
      Buffer pool 5G                   39435.25     47296.12                       +20
      Buffer pool 30G                  66053.68     67253.45                       +1.8%

  • Got 20-26% improvements for disk i/o bound workloads on
    Fusion I/O
         – Both CPU %user and %iowait were improved
                • %user: 36% (5.5.2) to 44% (5.5.4) when buf pool = 2g
                • %iowait: 8% (5.5.2) to 5.5% (5.5.4) when buf pool = 2g, but iops was
                  20% higher
         – Could handle a lot more concurrent i/o requests in 5.5.4 !
         – No big difference was found on 4 HDDs
                • Works very well on faster storages such as Fusion I/O, lots of disks
Copyright 2010 Sun Microsystems inc                         The World’s Most Popular Open Source Database   46
Conclusion for choosing H/W
       • Disks
              – PCI-E SSDs (i.e. Fusion I/O) perform very well
              – SAS/SATA SSDs (i.e. Intel X25)
              – Carefully research RAID controller. Many controllers do not
                scale with SSD drives
              – Keep enough reserved space if you need to handle massive
                write traffics
              – HDD is good at sequential writes

       • Use fast network adapter
              – 1GbE will be saturated on DRBD
              – 10GbE or Infiniband

       • Use Nahalem CPU
              – Especially when using PCI-Express SSDs
Copyright 2010 Sun Microsystems inc                  The World’s Most Popular Open Source Database   47
Conclusion for database deployments
       • Put sequentially written files on HDD
              –   ibdata, ib_logfile, binary log files
              –   HDD is fast enough for sequential writes
              –   Write performance deterioration can be mitigated
              –   Life expectancy of SSD will be longer


       • Put randomly accessed files on SSD
              – *ibd files, index files(MYI), data files(MYD)
              – SSD is 10x -100x faster for random reads than HDD


       • Archive less active tables/records to HDD
              – SSD is still much expensive than HDD


       • Use InnoDB Plugin
              – Higher scalability & concurrency matters on faster storage
Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   48
What will happen in the real database world?
    • These are just my thoughts..

    • Less demand for NoSQL
           – Isn’t it enough for many applications just to replace HDD with Fusion I/O?
           – Importance on functionality will be relatively stronger

    • Stronger demand for Virtualization
           – Single server will have enough capacity to run two or more mysqld
             instances

    • I/O volume matters
           – Not just IOPS
           – Block size, disabling doublewrite, etc

    • Concurrency matters
           – Single SSD scales as well as 8-16 HDDs
           – Concurrent ALTER TABLE, parallel query
Copyright 2010 Sun Microsystems inc                      The World’s Most Popular Open Source Database   49
Special Thanks To

       • Koji Watanabe – Fusion I/O Japan
       • Hideki Endo – Sumisho Computer Systems, Japan
              – Rent me two Fusion I/O 160GB SLC drives


       • Daisuke Homma, Masashi Hasegawa - Sun Japan
              – Did benchmarks together




Copyright 2010 Sun Microsystems inc                 The World’s Most Popular Open Source Database   50
Thanks for attending!

       • Contact:
              – E-mail: Yoshinori.Matsunobu@sun.com
              – Blog http://yoshinorimatsunobu.blogspot.com
              – @matsunobu on Twitter




Copyright 2010 Sun Microsystems inc                The World’s Most Popular Open Source Database   51
Copyright 2010 Sun Microsystems inc   The World’s Most Popular Open Source Database   52

Mais conteúdo relacionado

Mais procurados

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemFrederic Descamps
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Oracle RAC features on Exadata
Oracle RAC features on ExadataOracle RAC features on Exadata
Oracle RAC features on ExadataAnil Nair
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Mydbops
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLMorgan Tocker
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLYoshinori Matsunobu
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
 
Oracle RAC 19c - the Basis for the Autonomous Database
Oracle RAC 19c - the Basis for the Autonomous DatabaseOracle RAC 19c - the Basis for the Autonomous Database
Oracle RAC 19c - the Basis for the Autonomous DatabaseMarkus Michalewicz
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOpsSveta Smirnova
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to GreenplumDave Cramer
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfOracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfSrirakshaSrinivasan2
 

Mais procurados (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database System
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Oracle RAC features on Exadata
Oracle RAC features on ExadataOracle RAC features on Exadata
Oracle RAC features on Exadata
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
 
Oracle ASM Training
Oracle ASM TrainingOracle ASM Training
Oracle ASM Training
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQL
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
Oracle RAC 19c - the Basis for the Autonomous Database
Oracle RAC 19c - the Basis for the Autonomous DatabaseOracle RAC 19c - the Basis for the Autonomous Database
Oracle RAC 19c - the Basis for the Autonomous Database
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfOracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
 

Destaque

OPNAVINST 1420.1B (OFFICER PROGRAMS)
OPNAVINST 1420.1B (OFFICER PROGRAMS)OPNAVINST 1420.1B (OFFICER PROGRAMS)
OPNAVINST 1420.1B (OFFICER PROGRAMS)A.J. Stone
 
Short Run Aggregate Supply (SRAS)
Short Run Aggregate Supply (SRAS)Short Run Aggregate Supply (SRAS)
Short Run Aggregate Supply (SRAS)tutor2u
 
CTA vs SAP berbasis akrual
CTA vs SAP berbasis akrualCTA vs SAP berbasis akrual
CTA vs SAP berbasis akrualodhemamad
 
Fuel system,pptx
Fuel system,pptxFuel system,pptx
Fuel system,pptxVineet Garg
 
AS Macro Revision Aggregate Supply
AS Macro Revision Aggregate SupplyAS Macro Revision Aggregate Supply
AS Macro Revision Aggregate Supplytutor2u
 
A2 Economics Exam Technique - Weesteps to Evaluation
A2 Economics Exam Technique - Weesteps to EvaluationA2 Economics Exam Technique - Weesteps to Evaluation
A2 Economics Exam Technique - Weesteps to Evaluationtutor2u
 
Unemployment
UnemploymentUnemployment
Unemploymentchannyb
 
Agrregate Demand and Supply
Agrregate Demand and SupplyAgrregate Demand and Supply
Agrregate Demand and SupplyAleeza Baig
 
Joseph Kony and the LRA
Joseph Kony and the LRAJoseph Kony and the LRA
Joseph Kony and the LRAchristyleigh19
 
Macro diagrams and definitions
Macro diagrams and definitionsMacro diagrams and definitions
Macro diagrams and definitions12jostma
 
3.4 Demand And Supply Side Policies
3.4   Demand And Supply Side Policies3.4   Demand And Supply Side Policies
3.4 Demand And Supply Side PoliciesAndrew McCarthy
 
3.3 Macro Economic Models
3.3   Macro Economic Models3.3   Macro Economic Models
3.3 Macro Economic ModelsAndrew McCarthy
 
Furnace safegaurd supervisory system logic-1
Furnace safegaurd supervisory system logic-1Furnace safegaurd supervisory system logic-1
Furnace safegaurd supervisory system logic-1Ashvani Shukla
 
Adverse possession review
Adverse possession reviewAdverse possession review
Adverse possession review12900812
 

Destaque (20)

OPNAVINST 1420.1B (OFFICER PROGRAMS)
OPNAVINST 1420.1B (OFFICER PROGRAMS)OPNAVINST 1420.1B (OFFICER PROGRAMS)
OPNAVINST 1420.1B (OFFICER PROGRAMS)
 
Short Run Aggregate Supply (SRAS)
Short Run Aggregate Supply (SRAS)Short Run Aggregate Supply (SRAS)
Short Run Aggregate Supply (SRAS)
 
Ira y nefroproteccion
Ira y nefroproteccionIra y nefroproteccion
Ira y nefroproteccion
 
CTA vs SAP berbasis akrual
CTA vs SAP berbasis akrualCTA vs SAP berbasis akrual
CTA vs SAP berbasis akrual
 
LDO CWO applicantbrief
LDO CWO applicantbriefLDO CWO applicantbrief
LDO CWO applicantbrief
 
Fuel system,pptx
Fuel system,pptxFuel system,pptx
Fuel system,pptx
 
Inflation
InflationInflation
Inflation
 
AS Macro Revision Aggregate Supply
AS Macro Revision Aggregate SupplyAS Macro Revision Aggregate Supply
AS Macro Revision Aggregate Supply
 
A2 Economics Exam Technique - Weesteps to Evaluation
A2 Economics Exam Technique - Weesteps to EvaluationA2 Economics Exam Technique - Weesteps to Evaluation
A2 Economics Exam Technique - Weesteps to Evaluation
 
Unemployment
UnemploymentUnemployment
Unemployment
 
Memory management
Memory managementMemory management
Memory management
 
Chapter20 ppt
Chapter20 pptChapter20 ppt
Chapter20 ppt
 
Agrregate Demand and Supply
Agrregate Demand and SupplyAgrregate Demand and Supply
Agrregate Demand and Supply
 
Joseph Kony and the LRA
Joseph Kony and the LRAJoseph Kony and the LRA
Joseph Kony and the LRA
 
Macro diagrams and definitions
Macro diagrams and definitionsMacro diagrams and definitions
Macro diagrams and definitions
 
3.4 Demand And Supply Side Policies
3.4   Demand And Supply Side Policies3.4   Demand And Supply Side Policies
3.4 Demand And Supply Side Policies
 
3.3 Macro Economic Models
3.3   Macro Economic Models3.3   Macro Economic Models
3.3 Macro Economic Models
 
Furnace safegaurd supervisory system logic-1
Furnace safegaurd supervisory system logic-1Furnace safegaurd supervisory system logic-1
Furnace safegaurd supervisory system logic-1
 
Missles flight control systems
Missles flight control systemsMissles flight control systems
Missles flight control systems
 
Adverse possession review
Adverse possession reviewAdverse possession review
Adverse possession review
 

Semelhante a SSD Deployment Strategies for MySQL

Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends InnoTech
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Community
 
IBM Solid State in eX5 servers
IBM Solid State in eX5 serversIBM Solid State in eX5 servers
IBM Solid State in eX5 serversTony Pearson
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkI Goo Lee
 
002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptxDrewMe1
 
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera..."The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...eLiberatica
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonUri Cohen
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)MongoDB
 
Ssd And Enteprise Storage
Ssd And Enteprise StorageSsd And Enteprise Storage
Ssd And Enteprise StorageFrank Zhao
 
Presentation database on flash
Presentation   database on flashPresentation   database on flash
Presentation database on flashxKinAnx
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Guy Harrison
 
Solid State Drive Technology - MIT Lincoln Labs
Solid State Drive Technology - MIT Lincoln LabsSolid State Drive Technology - MIT Lincoln Labs
Solid State Drive Technology - MIT Lincoln LabsMatt Simmons
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Kyle Hailey
 
JetStor NAS 724uxd 724uxd 10g - technical presentation
JetStor NAS 724uxd 724uxd 10g - technical presentationJetStor NAS 724uxd 724uxd 10g - technical presentation
JetStor NAS 724uxd 724uxd 10g - technical presentationGene Leyzarovich
 

Semelhante a SSD Deployment Strategies for MySQL (20)

SSD PPT BY SAURABH
SSD PPT BY SAURABHSSD PPT BY SAURABH
SSD PPT BY SAURABH
 
SSD-Bondi.pptx
SSD-Bondi.pptxSSD-Bondi.pptx
SSD-Bondi.pptx
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Solid state drives
Solid state drivesSolid state drives
Solid state drives
 
Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends
 
IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
 
IBM Solid State in eX5 servers
IBM Solid State in eX5 serversIBM Solid State in eX5 servers
IBM Solid State in eX5 servers
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
 
002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx
 
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera..."The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
Ssd And Enteprise Storage
Ssd And Enteprise StorageSsd And Enteprise Storage
Ssd And Enteprise Storage
 
Presentation database on flash
Presentation   database on flashPresentation   database on flash
Presentation database on flash
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014
 
Solid State Drive Technology - MIT Lincoln Labs
Solid State Drive Technology - MIT Lincoln LabsSolid State Drive Technology - MIT Lincoln Labs
Solid State Drive Technology - MIT Lincoln Labs
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
 
JetStor NAS 724uxd 724uxd 10g - technical presentation
JetStor NAS 724uxd 724uxd 10g - technical presentationJetStor NAS 724uxd 724uxd 10g - technical presentation
JetStor NAS 724uxd 724uxd 10g - technical presentation
 

Mais de Yoshinori Matsunobu

Consistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityConsistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityYoshinori Matsunobu
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deploymentYoshinori Matsunobu
 
データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤Yoshinori Matsunobu
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話Yoshinori Matsunobu
 
MySQL for Large Scale Social Games
MySQL for Large Scale Social GamesMySQL for Large Scale Social Games
MySQL for Large Scale Social GamesYoshinori Matsunobu
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計Yoshinori Matsunobu
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexingYoshinori Matsunobu
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Yoshinori Matsunobu
 
Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Yoshinori Matsunobu
 

Mais de Yoshinori Matsunobu (11)

Consistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityConsistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced Durability
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
 
Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)
 
MySQL for Large Scale Social Games
MySQL for Large Scale Social GamesMySQL for Large Scale Social Games
MySQL for Large Scale Social Games
 
Automated master failover
Automated master failoverAutomated master failover
Automated master failover
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
 
Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)
 

Último

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Último (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

SSD Deployment Strategies for MySQL

  • 1. SSD Deployment Strategies for MySQL Yoshinori Matsunobu Lead of MySQL Professional Services APAC Sun Microsystems Yoshinori.Matsunobu@sun.com Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 1
  • 2. What do you need to consider? (H/W layer) • SSD or HDD? • Interface – SATA/SAS or PCI-Express? • RAID – H/W RAID, S/W RAID or JBOD? • Network – Is 1GbE enough? • Memory – Is 2GB RAM + PCI-E SSD faster than 64GB RAM + 8HDDs? • CPU – Nehalem or older Xeon? Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 2
  • 3. What do you need to consider? • Redundancy – RAID – DRBD (network mirroring) – Semi-Sync MySQL Replication – Async MySQL Replication • Filesystem – ext3, xfs, raw device ? • File location – Data file, Redo log file, etc • SSD specific issues – Write performance deterioration – Write endurance Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 3
  • 4. Why SSD? IOPS! • IOPS: Number of (random) disk i/o operations per second • Almost all database operations require random access – Selecting records by index scan – Updating records – Deleting records – Modifying indexes • Regular SAS HDD : 200 iops per drive (disk seek & rotation is slow) • SSD : 2,000+ (writes) / 5,000+ (reads) per drive – highly depending on SSDs and device drivers • Let’s start from basic benchmarks Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 4
  • 5. Tested HDD/SSD for this session • SSD – Intel X25-E (SATA, 30GB, SLC) – Fusion I/O (PCI-Express, 160GB, SLC) • HDD – Seagate 160GB SAS 15000RPM Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 5
  • 6. Table of contents • Basic Performance on SSD/HDD – Random Reads – Random Writes – Sequential Reads – Sequential Writes – fsync() speed – Filesystem difference – IOPS and I/O unit size • MySQL Deployments Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 6
  • 7. Random Read benchmark Direct Random Read IOPS (Single Drive, 16KB, xfs) 45000 40000 35000 30000 25000 HDD IOPS 20000 Intel SSD 15000 Fusion I/O 10000 5000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 # of I/O threads • HDD: 196 reads/s at 1 i/o thread, 443 reads/s at 100 i/o threads • Intel : 3508 reads/s at 1 i/o thread, 14538 reads/s at 100 i/o threads • Fusion I/O : 10526 reads/s at 1 i/o thread, 41379 reads/s at 100 i/o threads • Single thread throughput on Intel is 16x better than on HDD, Fusion is 25x better • SSD’s concurrency (4x) is much better than HDD’s (2.2x) • Very strong reason to use SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 7
  • 8. High Concurrency • Single SSD drive has multiple NAND Flash Memory chips (i.e. 40 x 4GB Flash Memory = 160GB) • Highly depending on I/O controller and Applications – Single threaded application can not gain concurrency advantage Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 8
  • 9. PCI-Express SSD CPU North Bridge South Bridge PCI-Express Controller SAS/SATA Controller 2GB/s (PCI-Express x 8) 300MB/s SSD I/O Controller SSD I/O Controller Flash Flash • Advantage – PCI-Express is much faster interface than SAS/SATA • (current) Disadvantages – Most motherboards have limited # of PCI-E slots – No hot swap mechanism Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 9
  • 10. Write performance on SSD Random Write IOPS (16KB Blocks) 20000 18000 16000 14000 12000 1 i/o thread 10000 100 i/o threads 8000 6000 4000 2000 0 HDD(4 RAID10 xfs) Intel(xfs) Fusion (xfs) • Very strong reason to use SSD • But wait.. Can we get a high write throughput *anytime*? – Not always.. Let’s check how data is written to Flash Memory Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 10
  • 11. Understanding how data is written to SSD (1) Block (empty) Block (empty) Block (empty) Block Page Page …. Flash memory chips • Single SSD drive consists of many flash memory chips (i.e. 2GB) • A flash memory chip internally consists of many blocks (i.e. 512KB) • A block internally consists of many pages (i.e. 4KB) • It is *not* possible to overwrite to a non-empty block – Reading from pages is possible – Writing to pages in an empty block is possible – Appending is possible – Overwriting to pages in a non-empty block is *not* possible Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 11
  • 12. Understanding how data is written to SSD (2) Block (empty) Block (empty) New data Block (empty) Block × Page Page …. • Overwriting to a non-empty block is not possible • Writing new data to an empty block instead • Writing to a non-empty block is fast (-200 microseconds) • Even though applications write to same positions in same files (i.e. InnoDB Log File), written pages/blocks are distributed (Wear-Leveling) Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 12
  • 13. Understanding how data is written to SSD (3) Block P Block P Block P P P P 1. Reading all pages Block P Block P Block P P P New P 2. Erasing the block Block Block P Block P P P P 3. Writing all data P P • In the long run, almost all blocks will be fully used New P – i.e. Allocating 158GB files on 160GB SSD • New empty block must be allocated on writes • Basic steps to write new data: – 1. Reading all pages from a block – 2. ERASE the block – 3. Writing all data w/ new data into the block • ERASE is very expensive operation (takes a few milliseconds) • At this stage, write performance becomes very slow because of massive ERASE operations Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 13
  • 14. Data Space Reserved Space Reserved Space Block P Block P Block P Block (empty) P P P Block P Block P Block P Block (empty) P P P Block Block P Block P 2. Writing data P P P 1. Reading pages P New data Background jobs ERASE unused blocks P • To keep high enough write performance, SSDs have a feature of “reserved space” • Data size visible to applications is limited to the size of data space – i.e. 160GB SSD, 120GB data space, 40GB reserved space • Fusion I/O has a functionality to change reserved space size – # fio-format -s 96G /dev/fct0 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 14
  • 15. Write performance deterioration Write IOPS deterioration (16KB random write) 30000 Continuous write-intensive workloads 25000 20000 IOPS Fastest 15000 Slowest 10000 5000 Stopping writing for a while 0 Intel Fusion(150G) Fusion(120G) Fusion(96G) Fusion(80G) • At the beginning, write IOPS was close to “Fastest” line • When massive writes happened, write IOPS gradually deteriorated toward “Slowest” line (because massive ERASE happened) • Increasing reserved space improves steady-state write throughput • Write IOPS recovered to “Fastest” when stopping writes for a long time (Many blocks were ERASEd by background job) • Highly depending on Flash memory and I/O controller (TRIM support, ERASE scheduling, etc) Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 15
  • 16. Sequential I/O Sequential Read/Write throughput (1MB consecutive reads/writes) 600 500 400 MB/s Seq read 300 Seq write 200 100 0 4 HDD(raid10, xfs) Intel(xfs) Fusion(xfs) • Typical scenario: Full table scan (read), logging/journaling (write) • SSD outperforms HDD for sequential reads, but less significant • HDD (4 RAID10) is fast enough for sequential i/o • Data transfer size by sequential writes tends to be huge, so you need to care about write deterioration on SSD • No strong reason to use SSD for sequential writes Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 16
  • 17. fsync() speed fsync speed 20000 18000 16000 14000 fsync/sec 12000 1KB 10000 8KB 8000 16KB 6000 4000 2000 0 HDD(xfs) Intel (xfs) Fusion I/O(xfs) • 10,000+ fsync/sec is fine in most cases • Fusion I/O was CPU bound (%system), not I/O bound (%iowait). Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 17
  • 18. HDD is fast for sequential writes / fsync • Best Practice: Writes can be boosted by using BBWC (Battery Backed up Write Cache), especially for REDO Logs (because it’s sequentially written) • No strong reason to use SSDs here seek & rotation time Write cache disk disk seek & rotation time Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 18
  • 19. Filesystem matters Random write iops (16KB Blocks) 20000 18000 16000 14000 12000 1 thread iops 10000 8000 16 thread 6000 4000 2000 0 Fusion(ext3) Fusion (xfs) Fusion (raw) Filesystem • On xfs, multiple threads can write to the same file if opened with O_DIRECT, but can not on ext* • Good concurrency on xfs, close to raw device • ext3 is less optimized for Fusion I/O Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 19
  • 20. Changing I/O unit size Read IOPS and I/O unit size (4 HDD RAID10) 2500 2000 1KB 1500 IOPS 4KB 1000 16KB 500 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • On HDD, maximum 22% performance difference was found between 1KB and 16KB • No big difference when concurrency < 10 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 20
  • 21. Changing I/O unit size on SSD Read IOPS and I/O unit size (Fusion I/O) 200000 150000 1KB IOPS 100000 4KB 16KB 50000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • Huge difference • On SSDs, not only IOPS, but also I/O transfer size matters • It’s worth considering that Storage Engines support “configurable block size” functionality Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 21
  • 22. Let’s start MySQL benchmarking • Base: Disk-bound application (DBT-2) running on: – Sun Fire X4270 – Nehalem 8 Core – 4 HDD – RAID1+0, Write Cache with Battery • What will happen if … – Replacing HDD with Intel SSD (SATA) – Replacing HDD with Fusion I/O (PCI-E) – Moving log files and ibdata to HDD – Not using Nehalem – Using two Fusion I/O drives with Software RAID1 – Deploying DRBD protocol B or C • Replacing 1GbE with 10GbE – Using MySQL 5.5.4 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 22
  • 23. DBT-2 condition • SuSE Enterprise Linux 11, xfs • MySQL 5.5.2M2 (InnoDB Plugin 1.0.6) • 200 Warehouses (20GB – 25GB hot data) • Buffer pool size – 1GB – 2GB – 5GB – 30GB (large enough to cache all data) • 1000 seconds warm up time • Running 3600 seconds (1 hour) • Fusion I/O: 96GB data space, 64GB reserved space Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 23
  • 24. HDD vs Intel SSD HDD Intel Buffer pool 1G 1125.44 5709.06 (NOTPM: Transactions per minute) • Storing all data on HDD or Intel SSD • Massive disk i/o happens – Random reads for all accesses – Random writes for updating rows and indexes – Sequential writes for REDO log files, etc • SSD is very good at these kinds of workloads • 5.5 times performance improvement, without any application change! Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 24
  • 25. HDD vs Intel SSD vs Fusion I/O HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 • Fusion I/O is a PCI-E based SSD • PCI-E is much faster than SAS/SATA • 14x improvement compared to 4HDDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 25
  • 26. Which should we spend money, RAM or SSD? HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 Buffer pool 2G 1863.19 Buffer pool 5G 4385.18 Buffer pool 30G 36784.76 (Caching all hot data) • Increasing RAM (buffer pool size) reduces random disk reads – Because more data are cached in the buffer pool • If all data are cached, only disk writes (both random and sequential) happen • Disk writes happen asynchronously, so application queries can be much faster • Large enough RAM + HDD outperforms too small RAM + SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 26
  • 27. Which should we spend money, RAM or SSD? HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 Buffer pool 2G 1863.19 7536.55 20096.33 Buffer pool 5G 4385.18 12892.56 30846.34 Buffer pool 30G 36784.76 - 57441.64 (Caching all hot data) • It is not always possible to cache all hot data • Fusion I/O + good amount of memory (5GB) was pretty good • Basic rule can be: – If you can cache all active data, large enough RAM + HDD – If you can’t, or if you need extremely high throughput, spending on both RAM and SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 27
  • 28. Let’s think about MySQL file location • SSD is extremely good at random reads • SSD is very good at random writes • HDD is good enough at sequential reads/writes • No strong reason to use SSD for sequential writes • Random I/O oriented: – Data Files (*.ibd) • Sequential reads if doing full table scan – Undo Log, Insert Buffer (ibdata) • UNDO tablespace (small in most cases, except for running long-running batch) • On-disk insert buffer space (small in most cases, except that InnoDB can not catch up with updating indexes) • Sequential Write oriented: – Doublewrite Buffer (ibdata) • Write volume is equal to *ibd files. Huge – Binary log (mysql-bin.XXXXXX) – Redo log (ib_logfile) – Backup files Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 28
  • 29. Moving sequentially written files into HDD Fusion I/O Fusion I/O + HDD Up Buffer pool 1G 15122.75 19295.94 +28% (us=25%, wa=15%) (us=32%, wa=10%) Buffer pool 2G 20096.33 25627.49 +28% (us=30%, wa=12.5%) (us=36%, wa=8%) Buffer pool 5G 30846.34 39435.25 +28% (us=39%, wa=10%) (us=49%, wa=6%) Buffer pool 30G 57441.64 66053.68 +15% (us=70%, wa=3.5%) (us=77%, wa=1%) • Moving ibdata, ib_logfile, (+binary logs) into HDD • High impact on performance – Write volume to SSD becomes half because doublewrite area is allocated in HDD – %iowait was significantly reduced – You can delay write performance deterioration Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 29
  • 30. Does CPU matter? Nehalem Older Xeon CPUs Memory CPUs QPI: 25.6GB/s FSB: 10.6GB/s North Bridge North Bridge (IOH) Memory (MCH) PCI-Express PCI-Express • Nehalem has two big advantages 1. Memory is directly attached to CPU : Faster for in-memory workloads 2. Interface speed between CPU and North Bridge is 2.5x higher, and interface traffics do not conflict with CPU<->Memory workloads : Faster for disk i/o workloads when using PCI-Express SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 30
  • 31. Harpertown X5470 (older Xeon) vs Nehalem X5570 (HDD) HDD Harpertown X5470, Nehalem(X5570, Up 3.33GHz 2.93GHz) Buffer pool 1G 1135.37 (us=1%) 1125.44 (us=1%) -1% Buffer pool 2G 1922.23 (us=2%) 1863.19 (us=2%) -3% Buffer pool 5G 4176.51 (us=7%) 4385.18(us=7%) +5% Buffer pool 30G 30903.4 (us=40%) 36784.76 (us=40%) +19% us: userland CPU utilization • CPU difference matters on CPU bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 31
  • 32. Harpertown X5470 vs Nehalem X5570 (Fusion) Fusion I/O+HDD Harportown X5470, Nehalem(X5570, Up 3.33GHz 2.93GHz) Buffer pool 1G 13534.06 (user=35%) 19295.94 (user=32%) +43% Buffer pool 2G 19026.64 (user=40%) 25627.49 (user=37%) +35% Buffer pool 5G 30058.48 (user=50%) 39435.25 (user=50%) +31% Buffer pool 30G 52582.71 (user=76%) 66053.68 (user=76%) +26% • TPM difference was much higher than HDD • For disk i/o bound workloads (buffer pool 1G/2G), CPU utilizations on Nehalem were smaller, but TPM were much higher – Verified that Nehalem is much more efficient for PCI-E workloads • Benefit from high interface speed between CPU and PCI-Express • Fusion I/O fits with Nehalem much better than with traditional CPUs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 32
  • 33. We need to think about redundancy overhead • Single server + No RAID is meaningless in the real database world • Redundancy – RAID 1 / 5 / 10 – Network mirroring (DRBD) – Replication (Sync / Async) • Relative overhead for redundancy will be (much) higher than on traditional HDD environment Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 33
  • 34. Fusion I/O + Software RAID1 • Fusion I/O itself has RAID5 feature – Writing parity bits into Flash Memory – Flash Chips are not Single Point of Failure – Controller / PCI-E Board is Single Point of Failure • Right now no H/W RAID controller is provided for PCI-E SSDs • Using Software RAID1 (or RAID10) – Two Fusion I/O drives in the same machine Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 34
  • 35. Understanding how software RAID1 works H/W RAID1 App/DB S/W RAID1 App/DB Writing to files Writing to files on /dev/sdX Response Response on /dev/md0 Write cache with battery Software RAID daemon RAID controller “md0_raid1” process Background writes Writing to disks (in parallel) (in parallel) Disk1 Disk2 Disk1 Disk2 • Response time on Software RAID1 is max(time-to-write-to-disk1, time-to-write-to-disk2) • If either of the two takes time for ERASE, response time will be longer • On faster storages / faster writes (i.e. sequential write + fsync), relative overheads of the software raid process are higher Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 35
  • 36. Random Write IOPS, S/W RAID1 vs No-RAID Random Write IOPS (Fusion I/O 160GB SLC, 16KB I/O unit, XFS) 50000 45000 40000 35000 No-RAID (120G) 30000 IOPS S/W RAID1 (120G) 25000 No-RAID (96G) 20000 15000 S/W RAID1 (96G) 10000 5000 0 1 61 121 181 241 301 361 421 481 Running time (minutes) • 120GB data space = 40GB additional reserved space • 96GB data space = 64GB additional reserved space • On S/W RAID1, IOPS deteriorated more quickly than on No-RAID • On S/W RAID1 with 96GB data space, the slowest line was smaller than No-RAID • 20-25% performance drop can be expected on disk write bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 36
  • 37. What about Reads? Read IOPS (16KB Blocks) 80000 70000 60000 50000 IOPS No-RAID 40000 S/W RAID1 30000 20000 10000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • Theoretically reads IOPS can be twice by RAID1 • Peak IOPS was 43636 on No-RAID, 75627 on RAID, 73% up • Good scalability Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 37
  • 38. DBT-2, No-RAID vs S/W RAID on Fusion I/O Fusion I/O+HDD RAID 1 Fusion %iowait Down I/O+HDD Buffer pool 1G 19295.94 15468.81 10% -19.8% Buffer pool 2G 25627.49 21405.23 8% -16.5% Buffer pool 5G 39435.25 35086.21 6-7% -11.0% Buffer pool 30G 66053.68 66426.52 0-1% +0.56% Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 38
  • 39. Intel SSDs with a traditional H/W raid controller Single raw Intel Four RAID5 Intel Down Buffer pool 1G 5709.06 2975.04 -48% Buffer pool 2G 7536.55 4763.60 -37% Buffer pool 5G 12892.56 11739.27 -9% • Raw SSD drives performed much better than using a traditional H/W raid controller – Even on RAID10 performance was worse than single raw drive – H/W Raid controller seemed serious bottleneck – Make sure SSD drives have write cache and capacitor itself (Intel X25- V/M/E doesn’t have capacitor) • Use JBOD + write cache + capacitor • Research appliances such as Schooner, Gear6, etc • Wait until H/W vendors release great H/R raid controllers that work well with SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 39
  • 40. What about DRBD? • Single server is not Highly Available – Mother Board/RAID Controller/etc are Single Point of Failure • Heartbeat + DRBD + MySQL is one of the most common HA (Active/Passive) solutions • Network might be a bottleneck – 1GbE -> 10GbE, InfiniBand, Dolphin Interconnect, etc • Replication level – Protocol A (async) – Protocol B (sync to remote drbd receiver process) – Protocol C (sync to remote disk) • Network channel is single threaded – Storing all data under /data (single DRBD partition) => single thread – Storing log/ibdata under /hdd, *ibd under /ssd => two threads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 40
  • 41. DRBD Overheads on HDD HDD No DRBD DRBD Protocol DRBD Protocol B, B, 1GbE 10GbE Buffer pool 1G 1125.44 1080.8 1101.63 Buffer pool 2G 1863.19 1824.75 1811.95 Buffer pool 5G 4385.18 4285.22 4326.22 Buffer pool 30G 36784.76 32862.81 35689.67 • DRBD 8.3.7 • DRBD overhead (protocol B) was not big on disk i/o bound workloads • Network bandwidth difference was not big on disk i/o bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 41
  • 42. DRBD Overheads on Fusion I/O Fusion I/O+HDD No DRBD DRBD Protocol Down DRBD Protocol Down B, 1GbE B, 10GbE Buffer pool 1G 19295.94 5976.18 -69.0% 12107.88 -37.3% Buffer pool 2G 25627.49 8100.5 -68.4% 16776.19 -34.5% Buffer pool 5G 39435.25 16073.9 -59.2% 30288.63 -23.2% Buffer pool 30G 66053.68 37974 -42.5% 62024.68 -6.1% • DRBD overhead was not negligible • 10GbE performed much better than 1GbE • Still 6-10 times faster than HDD • Note: DRBD supports faster interface such as InfiniBand SDP and Dolphin Interconnect Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 42
  • 43. Misc topic: Insert performance on InnoDB vs MyISAM (HDD) Time to insert 1 million records (HDD) 5000 4000 250 rows/s Seconds 3000 innodb 2000 myisam 1000 0 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 Existing records (millions) • MyISAM doesn’t do any special i/o optimization like “Insert Buffering” so a lot of random reads/writes happen, and highly depending on OS • Disk seek & rotation overhead is really serious on HDD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 43
  • 44. Note: Insert Buffering (InnoDB feature) • If non-unique, secondary index blocks are not in memory, InnoDB inserts entries to a special buffer(“insert buffer”) to avoid random disk i/o operations – Insert buffer is allocated on both memory and innodb SYSTEM tablespace • Periodically, the insert buffer is merged into the secondary index trees in the database (“merge”) Insert buffer • Pros: Reducing I/O overhead – Reducing the number of disk i/o operations by merging i/o requests to the same block Optimized i/o – Some random i/o operations can be sequential • Cons: Additional operations are added Merging might take a very long time – when many secondary indexes must be updated and many rows have been inserted. – it may continue to happen after a server shutdown and restart Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 44
  • 45. Insert performance: InnoDB vs MyISAM (SSD) Time to insert 1million records (SSD) 600 500 2,000 rows/s 400 Seconds InnoDB 300 MyISAM 200 5,000 rows/s 100 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 Existing records (millions) Index size exceeded buffer pool size Filesystem cache was fully used, disk reads began • MyISAM got much faster by just replacing HDD with SSD ! Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 45
  • 46. Try MySQL 5.5.4 ! Fusion I/O + HDD MySQL5.5.2 MySQL5.5.4 Up Buffer pool 1G 19295.94 24019.32 +24% Buffer pool 2G 25627.49 32325.76 +26% Buffer pool 5G 39435.25 47296.12 +20 Buffer pool 30G 66053.68 67253.45 +1.8% • Got 20-26% improvements for disk i/o bound workloads on Fusion I/O – Both CPU %user and %iowait were improved • %user: 36% (5.5.2) to 44% (5.5.4) when buf pool = 2g • %iowait: 8% (5.5.2) to 5.5% (5.5.4) when buf pool = 2g, but iops was 20% higher – Could handle a lot more concurrent i/o requests in 5.5.4 ! – No big difference was found on 4 HDDs • Works very well on faster storages such as Fusion I/O, lots of disks Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 46
  • 47. Conclusion for choosing H/W • Disks – PCI-E SSDs (i.e. Fusion I/O) perform very well – SAS/SATA SSDs (i.e. Intel X25) – Carefully research RAID controller. Many controllers do not scale with SSD drives – Keep enough reserved space if you need to handle massive write traffics – HDD is good at sequential writes • Use fast network adapter – 1GbE will be saturated on DRBD – 10GbE or Infiniband • Use Nahalem CPU – Especially when using PCI-Express SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 47
  • 48. Conclusion for database deployments • Put sequentially written files on HDD – ibdata, ib_logfile, binary log files – HDD is fast enough for sequential writes – Write performance deterioration can be mitigated – Life expectancy of SSD will be longer • Put randomly accessed files on SSD – *ibd files, index files(MYI), data files(MYD) – SSD is 10x -100x faster for random reads than HDD • Archive less active tables/records to HDD – SSD is still much expensive than HDD • Use InnoDB Plugin – Higher scalability & concurrency matters on faster storage Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 48
  • 49. What will happen in the real database world? • These are just my thoughts.. • Less demand for NoSQL – Isn’t it enough for many applications just to replace HDD with Fusion I/O? – Importance on functionality will be relatively stronger • Stronger demand for Virtualization – Single server will have enough capacity to run two or more mysqld instances • I/O volume matters – Not just IOPS – Block size, disabling doublewrite, etc • Concurrency matters – Single SSD scales as well as 8-16 HDDs – Concurrent ALTER TABLE, parallel query Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 49
  • 50. Special Thanks To • Koji Watanabe – Fusion I/O Japan • Hideki Endo – Sumisho Computer Systems, Japan – Rent me two Fusion I/O 160GB SLC drives • Daisuke Homma, Masashi Hasegawa - Sun Japan – Did benchmarks together Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 50
  • 51. Thanks for attending! • Contact: – E-mail: Yoshinori.Matsunobu@sun.com – Blog http://yoshinorimatsunobu.blogspot.com – @matsunobu on Twitter Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 51
  • 52. Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 52