SlideShare a Scribd company logo
1 of 78
NFS Tuning for Oracle

                         Kyle Hailey
http://dboptimizer.com    April 2013
Intro
 Who am I
    why NFS interests me
 DAS / NAS / SAN
    Throughput
    Latency
 NFS configuration issues*
    Network topology
    TCP config
    NFS Mount Options

* for non-RAC, non-dNFS
Who is Kyle Hailey

 1990 Oracle
     90 support
     92 Ported v6
     93 France
     95 Benchmarking
     98 ST Real World Performance


 2000 Dot.Com
 2001 Quest
 2002 Oracle OEM 10g                Success!
                                     First successful OEM design
Who is Kyle Hailey

 1990 Oracle
       90 support
       92 Ported v6
       93 France
       95 Benchmarking
       98 ST Real World Performance


   2000 Dot.Com
   2001 Quest
   2002 Oracle OEM 10g
   2005 Embarcadero
     DB Optimizer
Who is Kyle Hailey

 1990 Oracle
       90 support
       92 Ported v6
       93 France
       95 Benchmarking
       98 ST Real World Performance


   2000 Dot.Com
   2001 Quest
   2002 Oracle OEM 10g
   2005 Embarcadero
     DB Optimizer
 Delphix

When not being a Geek
 - Have a little 4 year old boy who takes up all my time
Typical Architecture


Production      Development            QA           UAT

  Instanc              Instanc       Instanc       Instanc
     e                    e             e             e



 Database              Database      Database      Database



  File system          File system   File system   File system
Database Virtualization

Source                      Clones
Production                 Development      QA         UAT


  Instance
                             Instance     Instance    Instance


                   NFS      vDatabas     vDatabas    vDatabas
  Database
                               e            e           e


  File system

                       Fiber Channel
Which to use?
DAS is out of the picture
Fibre Channel
NFS - available everywhere
NFS is attractive but is it fast enough?
DAS vs NAS vs SAN


     attach   Agile Cheap   maintenanc spee
                            e          d
DA   SCSI     no    yes     difficult  fast
S
NA   NFS -    yes   yes     easy        ??
S    Ethernet
SA   Fibre   yes    no      difficult   fast
N    Channel
speed


Ethernet
• 100Mb 1994
• 1GbE - 1998
• 10GbE – 2003
• 40GbE – 2012
• 100GE –est. 2013

Fibre Channel
• 1G 1998
• 2G 2003
• 4G – 2005
• 8G – 2008
• 16G – 2011
Ethernet vs Fibre Channel
Throughput vs Latency
Throughput
8 bits = 1 Byte

 1GbE        ~= 100 MB/sec
    30-60MB/sec typical, single threaded, mtu 1500
    90-115MB clean topology
    125MB/sec max
 10GbE       ~= 1000 MB/sec



                                       Throughput
                                           ~=
                                       size of pipe
Throughput : netio

   Server                                  Client
                                                               Throughput
                                                                   ~=
                                                               size of pipe


 Server
 netio –s -t
 Client
 netio –t server
 Packet size 1k bytes: 51.30 MByte/s Tx, 6.17 MByte/s Rx.
 Packet size 2k bytes: 100.10 MByte/s Tx, 12.29 MByte/s Rx.
 Packet size 4k bytes: 96.48 MByte/s Tx, 18.75 MByte/s Rx.
 Packet size 8k bytes: 114.38 MByte/s Tx, 30.41 MByte/s Rx.
 Packet size 16k bytes: 112.20 MByte/s Tx, 19.46 MByte/s Rx.
 Packet size 32k bytes: 114.53 MByte/s Tx, 35.11 MByte/s Rx.
Routers: traceroute

$ traceroute 172.16.100.144
1 172.16.100.144 (172.16.100.144) 0.226 ms 0.171 ms 0.123 ms



$ traceroute 101.88.123.195
1 101.88.229.181 (101.88.229.181)   0.761 ms   0.579 ms   0.493 ms
2 101.88.255.169 (101.88.255.169)   0.310 ms   0.286 ms   0.279 ms
3 101.88.218.166 (101.88.218.166)   0.347 ms   0.300 ms   0.986 ms
4 101.88.123.195 (101.88.123.195)   1.704 ms   1.972 ms   1.263 ms
            Tries each step 3 times
            Last line should be the total latency


                                                                latency
                                                                    ~=
                                                                Length of pipe
Wire Speed – where is the hold up?



                   ms      us   ns
           0.000 000 000
                                       Wire 5ns/m
Light = 0.3m/ns                        RAM
Wire ~= 0.2m/ns
                                Physical 8K random disk read ~= 7
Wire ~= 5us/km
                                ms
                                Physical small sequential write ~= 1
Data Center 10m = 50ns          ms

LA to London      = 30ms
LA to SF          = 3ms
4G FC vs 10GbE



Why would FC be faster?
8K block transfer times


       8GB FC            = 10us

       10G Ethernet = 8us
More stack more
latency
                    Oracle
                     NFS
                     TCP
                      NIC



                    Network      NFS
                                 Not FC


                      NIC
                     TCP
                     NFS
                   Filesystem
                  Cache/spindl
                        e
Oracle and SUN benchmark




200us overhead more for NFS
 • 350us without Jumbo frames
 • 1ms because of network topology



         8K blocks 1GbE with Jumbo Frames , Solaris 9, Oracle 9.2
         Database Performance with NAS: Optimizing Oracle on NFS
         Revised May 2009 | TR-3322
         http://media.netapp.com/documents/tr-3322.pdf
8K block NFS latency overhead

   8k wire transfer latency
    1 GbE -> 80 us
    10 GbE -> 8 us

   Upgrading 1GbE to 10GbE
    80 us – 8 us = 72 us faster 8K transfer

   8K NFS on 1GbE was 200us, so on 10GbE should be
    200 -72 = 128us

     (0.128 ms/7 ms) * 100 = 2% latency increase over
     DAS
NFS why the bad reputation?

 Given 2 % overhead why the reputation?
 Historically slower
 Bad Infrastructure
  1. Network topology and load
  2. NFS mount options
  3. TCP configuration
 Compounding issues
   Oracle configuration
   I/O subsystem response
Performance issues
1. Network
   a) Topology
   b) NICs and mismatches
   c) Load
2. TCP
   a) MTU “Jumbo Frames”
   b) window
   c) congestion window
3. NFS mount options
1. a) Network Topology


 Hubs
 Routers
 Switches
HUBs - no HUBs allowed
  Layer Name            Level       Device
  7      Application
  6      Presentation
  5      Session
  4      Transport

  3      Network IP addr            Routers
  2      Datalink mac addr          Switches
  1      Physical Wire              Hubs
           • Broadcast, repeaters
           • Risk collisions
           • Bandwidth contention
Multiple Switches – generally fast

 Two types of Switches
    Store and Forward
        1GbE 50-70us
        10GbE 5-35us
     Cut through
        10GbE .3-.5us



Laye Name
r
3      Network           Routers IP addr
2      Datalink          Switches mac
                                  addr
1      Physical          Hubs     Wire
Routers - as few as possible
 Routers can add 300-500us latency
 If NFS latency is 350us (typical non-tuned) system
 Then each router multiplies latency 2x, 3x, 4x etc




   Layer Name

   3       Network        IP addr     Routers

   2       Datalink       mac         Switches
                          addr
   1       Physical       Wire        Hubs
Every router adds
~ 0.5 ms
6 routers = 6 * .5 =
                        10
3ms
                         8
     Network Overhead
                         6
                                                   NFS
     6.0 ms avg I/O      4                         physical

                         2

                         0
                             Fast 0.2ms Slow 3ms
1. b) NICs and Hardware mismatch

NICs
    10GbE
    1GbE
      Intel (problems?)
      Broadcomm (ok?)


Hardware Mismatch: Speeds and duplex are often
negotiated
                 $ ethtool eth0
                 Settings for eth0:
Example Linux:           Advertised auto-negotiation: Yes
                         Speed: 100Mb/s
                         Duplex: Half


Check that values are as expected
Hardware mismatch


  Full Duplex - two way simultaneous GOOD




  Half Duplex - one way but both lines BAD
1. c) Busy Network
 Traffic can congest network
    Caused drop packets and retransmissions
    Out of order packets
    Collisions on hubs, probably not with switches
 Analysis
    netstat
    netio
Busy Network Monitoring

 Visibility difficult from any one machine
    Client
    Server
    Switch(es)
$ netstat -s -P tcp 1
TCP tcpRtoAlgorithm      =           4   tcpRtoMin           =     400
     tcpRetransSegs          =    5986   tcpRetransBytes     = 8268005
     tcpOutAck               =49277329   tcpOutAckDelayed    = 473798
     tcpInDupAck             = 357980    tcpInAckUnsent      =       0
     tcpInUnorderSegs        =10048089   tcpInUnorderBytes   =16611525
     tcpInDupSegs            =   62673   tcpInDupBytes       =87945913
     tcpInPartDupSegs        =      15   tcpInPartDupBytes   =     724
     tcpRttUpdate            = 4857114   tcpTimRetrans       =    1191
     tcpTimRetransDrop       =       6   tcpTimKeepalive     =     248
Busy Network Testing
 Netio is available here:
        http://www.ars.de/ars/ars.nsf/docs/netio
On Server box
 netio –s –t

On Target box:
 netio -t server_machine

 NETIO - Network Throughput Benchmark, Version 1.31
 (C) 1997-2010 Kai Uwe Rommel
 TCP server listening.
 TCP connection established ...
 Receiving from client, packet size 32k ... 104.37 MByte/s
 Sending to client, packet size 32k ...   109.27 MByte/s
 Done.
2. TCP Configuration



a) MTU (Jumbo Frames)
b) TCP window
c) TCP congestion window
2. a) MTU 9000 : Jumbo Frames


 MTU – maximum Transfer Unit
   Typically 1500
   Can be set 9000
   All components have to support
     Else error and/or hangs
Jumbo Frames : MTU 9000

8K block transfer
  Test 1                       Test 2
Default MTU 1500               Now with MTU 900
   delta     send       recd     delta        send    recd
                    <-- 164                        <-- 164
     152      132   -->                 273   8324 -->
      40     1448   -->
      67     1448   -->        Change MTU
      66     1448   -->        # ifconfig eth1 mtu 9000 up
      53     1448   -->
      87     1448   -->        Warning: MTU 9000 can hang
      95      952   -->        if any of the hardware in the
   = 560                       connection is configured only
                               for MTU 1500
2. b) TCP Socket Buffer
 Set max
    Socket Buffer
    TCP Window
 If maximum is reached, packets are dropped.


                   LINUX
                   
                    Socket buffer sizes
                        sysctl -w net.core.wmem_max=8388608
                        sysctl -w net.core.rmem_max=8388608
                    TCP Window sizes
                        sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608"
                        sysctl -w net.ipv4.tcp_wmem="4096 87380 8388608"

                   Excellent book
TCP window sizes

 max data send/receive before acknowledgement
   Subset of the TCP socket sizes
Calculating

    = latency * throughput

Ex, 1ms latency , 1Gb NIC

    = 1Gb/sec * 0.001s = 100Mb/sec * 1Byte/8bits= 125KB
2. c) Congestion window
delta      bytes        bytes     unack     cong          send
us         sent         recvd     bytes     window        window

  31       1448                  139760 144800         195200
  33       1448                  139760 144800         195200
  29       1448                  144104 146248         195200
  31            / 0               145552 144800         195200
  41       1448                  145552 < 147696       195200
  30            / 0               147000 > 144800       195200
  22       1448                  147000    76744       195200
Data collected with                       congestion window size
   Dtrace                                 is drastically lowered
But could get similar data from
   snoop (load data into
wireshark)
3. NFS mount options

      a) Forcedirectio – be careful
      b) rsize , wsize - increase
      c) Actimeo=0, noac – don’t use (unless on RAC)


Sun Solaris   rw,bg,hard,rsize=32768,wsize=32768,vers=3,[forcedirectio or llock],nointr,proto=tcp,suid

AIX           rw,bg,hard,rsize=32768,wsize=32768,vers=3,cio,intr,timeo=600,proto=tcp

HPUX          rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,proto=tcp, suid, forcedirectio

Linux         rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,tcp,actimeo=0
3 a) Forcedirectio

 Skip UNIX cache
 read directly into SGA
 Controlled by init.ora
    Filesystemio_options=SETALL (or directio)
    Except HPUX

    Sun       Forcedirectio – sets forces directio but not required
    Solaris
              Fielsystemio_options will set directio without mount option
    AIX


    HPUX      Forcedirectio – only way to set directio

              Filesystemio_options has no affect
    Linux
Direct I/O

Example query

   77951 physical reads , 2nd execution
   (ie when data should already be cached)


 5 secs => no direct I/O
 60 secs => direct I/O



 Why use direct I/O?
Oracle data block cache (SGA)   Oracle data block cache (SGA)




   Unix file system Cache          Unix file system Cache




    By default get both         Direct I/O can’t use Unix cache
Direct I/O

 Advantages
    Faster reads from disk
    Reduce CPU
    Reduce memory contention
    Faster access to data already in memory, in SGA
 Disadvantages
    Less Flexible
    Risk of paging , memory pressure
    Can’t share memory between multiple databases




    http://blogs.oracle.com/glennf/entry/where_do_you_cache_oracle
Oracle data block         Oracle data block
    cache (SGA)               cache (SGA)
                                                     Oracle data block
                                                       cache (SGA)

Unix file system Cache   Unix file system Cache
                                                      Unix file system
                                                          Cache

5 seconds                60 seconds               2 seconds
SGA Buffer
               UNIX Cache Hits
                                    Cache
                                        Hits          Miss
= UNIX Cache
   < 0.2 ms                             File System Cache
                                        Hits      Miss

                SAN Cache Hits
= NAS/SAN
  < 0.5 ms
                                               SAN/NA
                                               S
                                    Storage Cache
                       Disk Reads          Hits          Miss
                                                  Disk Reads
= Disk Reads
   ~ 6ms
                                               Hits
3. b)ACTIMEO=0 , NOAC

   Disable client side file attribute cache
   Increases NFS calls
   increases latency
   Avoid on single instance Oracle
   Metalink says it’s required on LINUX (calls it “actime”)
   Another metalink it should be taken off


=> It should be taken off
3. c)rsize/wsize

 NFS transfer buffer size
 Oracle says use 32K
 Platforms support higher values and can significantly
  impact throughput
       Sun       rsize=32768,wsize=32768 , max is 1M
       Solaris
       AIX       rsize=32768,wsize=32768 , max is 64K

       HPUX      rsize=32768,wsize=32768 , max is 1M

       Linux     rsize=32768,wsize=32768 , max is 1M



    On full table scans using 1M has halved the response time over 32K
    Db_file_multiblock_read_count has to large enough take advantage of the
    size
Max rsize , wsize

 1M
    Sun Solaris Sparc
    HP/UX
    Linux
 64K
    AIX
    Patch to increase to 512
         http://www-01.ibm.com/support/docview.wss?uid=isg1IV24594
         http://www-01.ibm.com/support/docview.wss?uid=isg1IV24688

         http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=6100-08-00-1241
         http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=7100-02-00-1241
NFS Overhead Physical vs Cached IO


Storage
(SAN)
            9

            8

 Cache      7

            6

            5

            4                                                                                   NFS + network
                                                                                                reads
            3

 Physic     2

 al         1

            0
                phys vs NFS   phys multi   phys routers cached vs nfs   cached vs   cached vs
                                switch                                   switches    routers
NFS Overhead Physical vs Cached IO


 100 us + 6.0 ms
 100 us + 0.1 ms -> 2x slower

 SAN cache is expensive -> use it for write
 cache
 Target cache is cheaper –> add more
  Takeaway:


          Move cache to client boxes
  • cheaper
  • Quicker

  save SAN cache for writeback
Advanced
 LINUX
    Rpc queues
    Iostat.py
 Solaris
    Client max read size 32k , boost it 1M
    NFS server default threads 16
 Tcpdump
Issues: LINUX rpc queue


  On LINUX, in /etc/sysctl.conf modify

      sunrpc.tcp_slot_table_entries = 128

  then do

      sysctl -p

  then check the setting using

      sysctl -A | grep sunrpc

  NFS partitions will have to be unmounted and remounted
  Not persistent across reboot
Linux tools: iostat.py


$ ./iostat.py -1

172.16.100.143:/vdb17 mounted on /mnt/external:

 op/s rpc bklog
 4.20    0.00

read:   ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms)
      0.000 0.000 0.000 0 (0.0%)  0.000      0.000
write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms)
      0.000 0.000 0.000 0 (0.0%)  0.000      0.000
Solaris Client max read size
 nfs3_bsize
    Defaults to 32K
    Set it to 1M


      mdb -kw
      > nfs3_bsize/D
        nfs3_bsize: 32768
      > nfs3_bsize/W 100000

      nfs3_bsize:      0xd     =   0x1000
      00
      >
Issues: Solaris NFS Server threads



sharectl get -p servers nfs



sharectl set -p servers=512 nfs
svcadm refresh nfs/server
Tcpdump
Analysis
     LINUX     Solaris
          ms   ms

Oracl                     Oracle
e
      58       47
                           NFS
NFS
          ?    ?           TCP
/TCP
Network   ?    ?
                          Network
TCP/NF
          ?    ?
S
                           TCP

NFS                        NFS
       .1      2
server                   Cache/SAN
Oracle


snoop / tcpdump    TCP

                  Network
        snoop
                   TCP

                   NFS
Wireshark : analyze TCP dumps
  yum install wireshark
  wireshark + perl
     find common NFS requests
         NFS client
         NFS server
     display times for
         NFS Client
         NFS Server
         Delta



https://github.com/khailey/tcpdump
tcpdump -s 512 host server_machine -w client.cap


                                                   Oracle


snoop / tcpdump                                       TCP

                                                   Network
        snoop
                                                      TCP

                                                      NFS




    snoop -q -d aggr0 -s 512 –o server.cap client_machine
parsetcp.pl nfs_server.cap nfs_client.cap
  Parsing nfs server trace: nfs_server.cap
  type    avg ms count
    READ : 44.60, 7731

  Parsing client trace: nfs_client.cap
  type    avg ms count
    READ : 46.54, 15282

   ==================== MATCHED DATA ============
  READ
  type      avg ms
   server : 48.39,
   client : 49.42,
    diff : 1.03,
  Processed 9647 packets (Matched: 5624 Missed: 4023)
parsetcp.pl server.cap client.cap
   Parsing NFS server trace: server.cap
   type    avg ms count
     READ : 1.17, 9042

   Parsing client trace: client.cap
   type    avg ms count
     READ : 1.49, 21984

   ==================== MATCHED DATA ============
   READ
   type      avg ms count
    server : 1.03
    client : 1.49
     diff : 0.46
Linux   Solaris   tool     source
                                               “db file sequential
                                               read” wait (which is
          Oracle    58 ms   47 ms oramon.sh    basically a timing of
                                               “pread” for 8k random
Oracle                                         reads specifically
 NFS                                           tcpdump on
          NFS                     tcpparse.s
 TCP                1.5     45 ms              LINUX snoop
          Client                  h            on Solaris


          network    0.5     1 ms              Delta
Network

                                  tcpparse.s
 TCP      Server    1 ms    44 ms h            snoop



          NFS                                  dtrace nfs:::op-read-
                    .1 ms    2 ms DTrace       start/op-read-done
 NFS      Server
netperf.sh

mss: 1448
 local_recv_size (beg,end): 128000 128000
 local_send_size (beg,end):  49152   49152
remote_recv_size (beg,end):   87380 3920256
remote_send_size (beg,end):    16384  16384

mn_ms av_ms max_ms s_KB r_KB r_MB/s s_MB/s <100u <500u <1ms <5ms <10ms <5
 .08 .12 10.91                15.69 83.92 .33 .38 .01 .01           .12 .54
 .10 .16 12.25 8      48.78          99.10 .30 .82 .07 .08           .15 .57
 .10 .14 5.01       8     54.78     99.04 .88 .96                .15 .60
 .22 .34 63.71 128     367.11         97.50 1.57 2.42 .06 .07 .01         .35 .93
 .43 .60 16.48     128     207.71      84.86 11.75 15.04 .05 .10          .90 1.42
 .99 1.30 412.42 1024    767.03               .05 99.90 .03 .08   .03 1.30 2.25
1.77 2.28 15.43     1024     439.20             99.27 .64 .73         2.65 5.35
Summary
 Test
    Traceroute : routers and latency
    Netio : throughput
 NFS close to FC , if Network topology clean
    Mount
       Rsize/wsize at maximum,
       Avoid actimeo=0 and noac
       Use noactime
    Jumbo Frames
 Drawbacks
    Requires clean topology
    NFS failover can take 10s of seconds
    Oracle 11g dNFS client mount issues handled
     transparently
Conclusion: Give NFS some love




                       NFS




         gigabit switch can be anywhere from
         10 to 50 times cheaper than an FC switch

                 http://github.com/khailey
dtrace

List the names of traceable probes:

dtrace –ln provider:module:function:name

• -l = list instead of enable probes       Example
• -n = Specify probe name to trace or list dtrace –ln tcp:::send
• -v = Set verbose mode                    $ dtrace -lvn tcp:::receive
                                           5473 tcp ip tcp_output send

                                             Argument Types
                                                 args[0]: pktinfo_t *
                                                 args[1]: csinfo_t *
                                                 args[2]: ipinfo_t *
                                                 args[3]: tcpsinfo_t *
                                                 args[4]: tcpinfo_t *
http://cvs.opensolaris.org/source/
Dtrace
tcp:::send, tcp:::receive
{   delta= timestamp-walltime;
    walltime=timestamp;
    printf("%6d %6d %6d %8d  %8s %8d %8d %8d %8d %d n",
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        delta/1000,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[3]->tcps_swnd,
        args[3]->tcps_rwnd,
        args[3]->tcps_cwnd,
        args[3]->tcps_retransmit
      );
}
tcp:::receive
{     delta=timestamp-walltime;
      walltime=timestamp;
      printf("%6d %6d %6d %8s / %-8d %8d %8d %8d %8d %d n",
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        delta/1000,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_swnd,
        args[3]->tcps_rwnd,
        args[3]->tcps_cwnd,
        args[3]->tcps_retransmit
      );
}
Delayed
Acknowledgeme
nt
Solaris

   ndd -set /dev/tcp       tcp_max_buf 8388608
   ndd -set /dev/tcp       tcp_recv_hiwat 4194304
   ndd -set /dev/tcp       tcp_xmit_hiwat 4194304
   ndd -set /dev/tcp       tcp_cwnd_max 8388608


    mdb -kw
    > nfs3_bsize/D
      nfs3_bsize: 32768
    > nfs3_bsize/W 100000

    nfs3_bsize:    0xd         =     0x100000
    >

    add it to /etc/system for use on reboot

     set nfs:nfs3_bsize= 0x100000
Direct I/O Challenges


Database Cache usage over 24 hours




           DB1                DB2    DB3
           Europe             US     Asia

More Related Content

What's hot

Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Building a Virtualized Continuum with Intel(r) Clear Containers
Building a Virtualized Continuum with Intel(r) Clear ContainersBuilding a Virtualized Continuum with Intel(r) Clear Containers
Building a Virtualized Continuum with Intel(r) Clear ContainersMichelle Holley
 
Deeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksDeeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksLaurent Bernaille
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community6WIND
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksLaurent Bernaille
 
Troubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerTroubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerJeff Anderson
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksLaurent Bernaille
 
How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.Naoto MATSUMOTO
 
Ceph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To EnterpriseCeph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To EnterpriseAlex Lau
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudPatrick McGarry
 
Implementing distributed mclock in ceph
Implementing distributed mclock in cephImplementing distributed mclock in ceph
Implementing distributed mclock in ceph병수 박
 
Practical Tips for Novell Cluster Services
Practical Tips for Novell Cluster ServicesPractical Tips for Novell Cluster Services
Practical Tips for Novell Cluster ServicesNovell
 
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...Novell
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...Jim St. Leger
 
Linux network stack
Linux network stackLinux network stack
Linux network stackTakuya ASADA
 

What's hot (20)

Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Building a Virtualized Continuum with Intel(r) Clear Containers
Building a Virtualized Continuum with Intel(r) Clear ContainersBuilding a Virtualized Continuum with Intel(r) Clear Containers
Building a Virtualized Continuum with Intel(r) Clear Containers
 
Deeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksDeeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay Networks
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay Networks
 
Troubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerTroubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support Engineer
 
Dpdk performance
Dpdk performanceDpdk performance
Dpdk performance
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay Networks
 
How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.
 
Ceph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To EnterpriseCeph Day Bring Ceph To Enterprise
Ceph Day Bring Ceph To Enterprise
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Implementing distributed mclock in ceph
Implementing distributed mclock in cephImplementing distributed mclock in ceph
Implementing distributed mclock in ceph
 
Practical Tips for Novell Cluster Services
Practical Tips for Novell Cluster ServicesPractical Tips for Novell Cluster Services
Practical Tips for Novell Cluster Services
 
Cl306
Cl306Cl306
Cl306
 
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
Front-End Optimization (FEO)
Front-End Optimization (FEO)Front-End Optimization (FEO)
Front-End Optimization (FEO)
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
 
Linux network stack
Linux network stackLinux network stack
Linux network stack
 

Similar to Collaborate nfs kyle_final

LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linuxbrouer
 
Audio video ethernet (avb cobra net dante)
Audio video ethernet (avb cobra net dante)Audio video ethernet (avb cobra net dante)
Audio video ethernet (avb cobra net dante)Jeff Green
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
High perf-networking
High perf-networkingHigh perf-networking
High perf-networkingmtimjones
 
Avb pov 2017 v2
Avb pov 2017 v2Avb pov 2017 v2
Avb pov 2017 v2Jeff Green
 
Wireless Troubleshooting Tips using AirPcaps DFS Module Debugging
Wireless Troubleshooting Tips using AirPcaps DFS Module DebuggingWireless Troubleshooting Tips using AirPcaps DFS Module Debugging
Wireless Troubleshooting Tips using AirPcaps DFS Module DebuggingMegumi Takeshita
 
Zigbee intro v5
Zigbee intro v5Zigbee intro v5
Zigbee intro v5rajrayala
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance ConsiderationsShawn Wells
 
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...PROIDEA
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...Jorge E. López de Vergara Méndez
 
ARM LPC2300/LPC2400 TCP/IP Stack Porting
ARM LPC2300/LPC2400 TCP/IP Stack PortingARM LPC2300/LPC2400 TCP/IP Stack Porting
ARM LPC2300/LPC2400 TCP/IP Stack PortingMathivanan Elangovan
 
Data centre networking at London School of Economics and Political Science - ...
Data centre networking at London School of Economics and Political Science - ...Data centre networking at London School of Economics and Political Science - ...
Data centre networking at London School of Economics and Political Science - ...Jisc
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdfJunZhao68
 
Telecommunications: Wireless Networks
Telecommunications: Wireless NetworksTelecommunications: Wireless Networks
Telecommunications: Wireless NetworksNapier University
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
 

Similar to Collaborate nfs kyle_final (20)

LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
Audio video ethernet (avb cobra net dante)
Audio video ethernet (avb cobra net dante)Audio video ethernet (avb cobra net dante)
Audio video ethernet (avb cobra net dante)
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
High perf-networking
High perf-networkingHigh perf-networking
High perf-networking
 
Avb pov 2017 v2
Avb pov 2017 v2Avb pov 2017 v2
Avb pov 2017 v2
 
Wireless Troubleshooting Tips using AirPcaps DFS Module Debugging
Wireless Troubleshooting Tips using AirPcaps DFS Module DebuggingWireless Troubleshooting Tips using AirPcaps DFS Module Debugging
Wireless Troubleshooting Tips using AirPcaps DFS Module Debugging
 
Zigbee intro v5
Zigbee intro v5Zigbee intro v5
Zigbee intro v5
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
 
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...
 
ARM LPC2300/LPC2400 TCP/IP Stack Porting
ARM LPC2300/LPC2400 TCP/IP Stack PortingARM LPC2300/LPC2400 TCP/IP Stack Porting
ARM LPC2300/LPC2400 TCP/IP Stack Porting
 
Data centre networking at London School of Economics and Political Science - ...
Data centre networking at London School of Economics and Political Science - ...Data centre networking at London School of Economics and Political Science - ...
Data centre networking at London School of Economics and Political Science - ...
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
VOIP QOS
VOIP QOSVOIP QOS
VOIP QOS
 
Telecommunications: Wireless Networks
Telecommunications: Wireless NetworksTelecommunications: Wireless Networks
Telecommunications: Wireless Networks
 
QNAP TS-832PX-4G.pdf
QNAP TS-832PX-4G.pdfQNAP TS-832PX-4G.pdf
QNAP TS-832PX-4G.pdf
 
slides
slidesslides
slides
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 

More from Kyle Hailey

Hooks in postgresql by Guillaume Lelarge
Hooks in postgresql by Guillaume LelargeHooks in postgresql by Guillaume Lelarge
Hooks in postgresql by Guillaume LelargeKyle Hailey
 
Performance insights twitch
Performance insights twitchPerformance insights twitch
Performance insights twitchKyle Hailey
 
History of database monitoring
History of database monitoringHistory of database monitoring
History of database monitoringKyle Hailey
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Kyle Hailey
 
Successfully convince people with data visualization
Successfully convince people with data visualizationSuccessfully convince people with data visualization
Successfully convince people with data visualizationKyle Hailey
 
Virtual Data : Eliminating the data constraint in Application Development
Virtual Data :  Eliminating the data constraint in Application DevelopmentVirtual Data :  Eliminating the data constraint in Application Development
Virtual Data : Eliminating the data constraint in Application DevelopmentKyle Hailey
 
DBTA Data Summit : Eliminating the data constraint in Application Development
DBTA Data Summit : Eliminating the data constraint in Application DevelopmentDBTA Data Summit : Eliminating the data constraint in Application Development
DBTA Data Summit : Eliminating the data constraint in Application DevelopmentKyle Hailey
 
Accelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual DataAccelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual DataKyle Hailey
 
Delphix and Pure Storage partner
Delphix and Pure Storage partnerDelphix and Pure Storage partner
Delphix and Pure Storage partnerKyle Hailey
 
Mark Farnam : Minimizing the Concurrency Footprint of Transactions
Mark Farnam  : Minimizing the Concurrency Footprint of TransactionsMark Farnam  : Minimizing the Concurrency Footprint of Transactions
Mark Farnam : Minimizing the Concurrency Footprint of TransactionsKyle Hailey
 
Dan Norris: Exadata security
Dan Norris: Exadata securityDan Norris: Exadata security
Dan Norris: Exadata securityKyle Hailey
 
Martin Klier : Volkswagen for Oracle Guys
Martin Klier : Volkswagen for Oracle GuysMartin Klier : Volkswagen for Oracle Guys
Martin Klier : Volkswagen for Oracle GuysKyle Hailey
 
Data as a Service
Data as a Service Data as a Service
Data as a Service Kyle Hailey
 
Data Virtualization: Revolutionizing data cloning
Data Virtualization: Revolutionizing data cloningData Virtualization: Revolutionizing data cloning
Data Virtualization: Revolutionizing data cloning Kyle Hailey
 
BGOUG "Agile Data: revolutionizing database cloning'
BGOUG  "Agile Data: revolutionizing database cloning'BGOUG  "Agile Data: revolutionizing database cloning'
BGOUG "Agile Data: revolutionizing database cloning'Kyle Hailey
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationKyle Hailey
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Kyle Hailey
 
Jonathan Lewis explains Delphix
Jonathan Lewis explains Delphix Jonathan Lewis explains Delphix
Jonathan Lewis explains Delphix Kyle Hailey
 
Oaktable World 2014 Toon Koppelaars: database constraints polite excuse
Oaktable World 2014 Toon Koppelaars: database constraints polite excuseOaktable World 2014 Toon Koppelaars: database constraints polite excuse
Oaktable World 2014 Toon Koppelaars: database constraints polite excuseKyle Hailey
 

More from Kyle Hailey (20)

Hooks in postgresql by Guillaume Lelarge
Hooks in postgresql by Guillaume LelargeHooks in postgresql by Guillaume Lelarge
Hooks in postgresql by Guillaume Lelarge
 
Performance insights twitch
Performance insights twitchPerformance insights twitch
Performance insights twitch
 
History of database monitoring
History of database monitoringHistory of database monitoring
History of database monitoring
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle
 
Successfully convince people with data visualization
Successfully convince people with data visualizationSuccessfully convince people with data visualization
Successfully convince people with data visualization
 
Virtual Data : Eliminating the data constraint in Application Development
Virtual Data :  Eliminating the data constraint in Application DevelopmentVirtual Data :  Eliminating the data constraint in Application Development
Virtual Data : Eliminating the data constraint in Application Development
 
DBTA Data Summit : Eliminating the data constraint in Application Development
DBTA Data Summit : Eliminating the data constraint in Application DevelopmentDBTA Data Summit : Eliminating the data constraint in Application Development
DBTA Data Summit : Eliminating the data constraint in Application Development
 
Accelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual DataAccelerate Develoment with VIrtual Data
Accelerate Develoment with VIrtual Data
 
Delphix and Pure Storage partner
Delphix and Pure Storage partnerDelphix and Pure Storage partner
Delphix and Pure Storage partner
 
Mark Farnam : Minimizing the Concurrency Footprint of Transactions
Mark Farnam  : Minimizing the Concurrency Footprint of TransactionsMark Farnam  : Minimizing the Concurrency Footprint of Transactions
Mark Farnam : Minimizing the Concurrency Footprint of Transactions
 
Dan Norris: Exadata security
Dan Norris: Exadata securityDan Norris: Exadata security
Dan Norris: Exadata security
 
Martin Klier : Volkswagen for Oracle Guys
Martin Klier : Volkswagen for Oracle GuysMartin Klier : Volkswagen for Oracle Guys
Martin Klier : Volkswagen for Oracle Guys
 
What is DevOps
What is DevOpsWhat is DevOps
What is DevOps
 
Data as a Service
Data as a Service Data as a Service
Data as a Service
 
Data Virtualization: Revolutionizing data cloning
Data Virtualization: Revolutionizing data cloningData Virtualization: Revolutionizing data cloning
Data Virtualization: Revolutionizing data cloning
 
BGOUG "Agile Data: revolutionizing database cloning'
BGOUG  "Agile Data: revolutionizing database cloning'BGOUG  "Agile Data: revolutionizing database cloning'
BGOUG "Agile Data: revolutionizing database cloning'
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
 
Jonathan Lewis explains Delphix
Jonathan Lewis explains Delphix Jonathan Lewis explains Delphix
Jonathan Lewis explains Delphix
 
Oaktable World 2014 Toon Koppelaars: database constraints polite excuse
Oaktable World 2014 Toon Koppelaars: database constraints polite excuseOaktable World 2014 Toon Koppelaars: database constraints polite excuse
Oaktable World 2014 Toon Koppelaars: database constraints polite excuse
 

Collaborate nfs kyle_final

  • 1. NFS Tuning for Oracle Kyle Hailey http://dboptimizer.com April 2013
  • 2. Intro  Who am I  why NFS interests me  DAS / NAS / SAN  Throughput  Latency  NFS configuration issues*  Network topology  TCP config  NFS Mount Options * for non-RAC, non-dNFS
  • 3. Who is Kyle Hailey  1990 Oracle  90 support  92 Ported v6  93 France  95 Benchmarking  98 ST Real World Performance  2000 Dot.Com  2001 Quest  2002 Oracle OEM 10g Success! First successful OEM design
  • 4. Who is Kyle Hailey  1990 Oracle  90 support  92 Ported v6  93 France  95 Benchmarking  98 ST Real World Performance  2000 Dot.Com  2001 Quest  2002 Oracle OEM 10g  2005 Embarcadero  DB Optimizer
  • 5. Who is Kyle Hailey  1990 Oracle  90 support  92 Ported v6  93 France  95 Benchmarking  98 ST Real World Performance  2000 Dot.Com  2001 Quest  2002 Oracle OEM 10g  2005 Embarcadero  DB Optimizer  Delphix When not being a Geek - Have a little 4 year old boy who takes up all my time
  • 6. Typical Architecture Production Development QA UAT Instanc Instanc Instanc Instanc e e e e Database Database Database Database File system File system File system File system
  • 7. Database Virtualization Source Clones Production Development QA UAT Instance Instance Instance Instance NFS vDatabas vDatabas vDatabas Database e e e File system Fiber Channel
  • 9. DAS is out of the picture
  • 11.
  • 12. NFS - available everywhere
  • 13. NFS is attractive but is it fast enough?
  • 14. DAS vs NAS vs SAN attach Agile Cheap maintenanc spee e d DA SCSI no yes difficult fast S NA NFS - yes yes easy ?? S Ethernet SA Fibre yes no difficult fast N Channel
  • 15. speed Ethernet • 100Mb 1994 • 1GbE - 1998 • 10GbE – 2003 • 40GbE – 2012 • 100GE –est. 2013 Fibre Channel • 1G 1998 • 2G 2003 • 4G – 2005 • 8G – 2008 • 16G – 2011
  • 16. Ethernet vs Fibre Channel
  • 18. Throughput 8 bits = 1 Byte  1GbE ~= 100 MB/sec  30-60MB/sec typical, single threaded, mtu 1500  90-115MB clean topology  125MB/sec max  10GbE ~= 1000 MB/sec Throughput ~= size of pipe
  • 19. Throughput : netio Server Client Throughput ~= size of pipe Server netio –s -t Client netio –t server Packet size 1k bytes: 51.30 MByte/s Tx, 6.17 MByte/s Rx. Packet size 2k bytes: 100.10 MByte/s Tx, 12.29 MByte/s Rx. Packet size 4k bytes: 96.48 MByte/s Tx, 18.75 MByte/s Rx. Packet size 8k bytes: 114.38 MByte/s Tx, 30.41 MByte/s Rx. Packet size 16k bytes: 112.20 MByte/s Tx, 19.46 MByte/s Rx. Packet size 32k bytes: 114.53 MByte/s Tx, 35.11 MByte/s Rx.
  • 20. Routers: traceroute $ traceroute 172.16.100.144 1 172.16.100.144 (172.16.100.144) 0.226 ms 0.171 ms 0.123 ms $ traceroute 101.88.123.195 1 101.88.229.181 (101.88.229.181) 0.761 ms 0.579 ms 0.493 ms 2 101.88.255.169 (101.88.255.169) 0.310 ms 0.286 ms 0.279 ms 3 101.88.218.166 (101.88.218.166) 0.347 ms 0.300 ms 0.986 ms 4 101.88.123.195 (101.88.123.195) 1.704 ms 1.972 ms 1.263 ms Tries each step 3 times Last line should be the total latency latency ~= Length of pipe
  • 21. Wire Speed – where is the hold up? ms us ns 0.000 000 000 Wire 5ns/m Light = 0.3m/ns RAM Wire ~= 0.2m/ns Physical 8K random disk read ~= 7 Wire ~= 5us/km ms Physical small sequential write ~= 1 Data Center 10m = 50ns ms LA to London = 30ms LA to SF = 3ms
  • 22. 4G FC vs 10GbE Why would FC be faster? 8K block transfer times  8GB FC = 10us  10G Ethernet = 8us
  • 23. More stack more latency Oracle NFS TCP NIC Network NFS Not FC NIC TCP NFS Filesystem Cache/spindl e
  • 24. Oracle and SUN benchmark 200us overhead more for NFS • 350us without Jumbo frames • 1ms because of network topology 8K blocks 1GbE with Jumbo Frames , Solaris 9, Oracle 9.2 Database Performance with NAS: Optimizing Oracle on NFS Revised May 2009 | TR-3322 http://media.netapp.com/documents/tr-3322.pdf
  • 25. 8K block NFS latency overhead 8k wire transfer latency  1 GbE -> 80 us  10 GbE -> 8 us Upgrading 1GbE to 10GbE  80 us – 8 us = 72 us faster 8K transfer 8K NFS on 1GbE was 200us, so on 10GbE should be  200 -72 = 128us (0.128 ms/7 ms) * 100 = 2% latency increase over DAS
  • 26. NFS why the bad reputation?  Given 2 % overhead why the reputation?  Historically slower  Bad Infrastructure 1. Network topology and load 2. NFS mount options 3. TCP configuration  Compounding issues  Oracle configuration  I/O subsystem response
  • 27. Performance issues 1. Network a) Topology b) NICs and mismatches c) Load 2. TCP a) MTU “Jumbo Frames” b) window c) congestion window 3. NFS mount options
  • 28. 1. a) Network Topology  Hubs  Routers  Switches
  • 29. HUBs - no HUBs allowed Layer Name Level Device 7 Application 6 Presentation 5 Session 4 Transport 3 Network IP addr Routers 2 Datalink mac addr Switches 1 Physical Wire Hubs • Broadcast, repeaters • Risk collisions • Bandwidth contention
  • 30. Multiple Switches – generally fast  Two types of Switches  Store and Forward  1GbE 50-70us  10GbE 5-35us  Cut through  10GbE .3-.5us Laye Name r 3 Network Routers IP addr 2 Datalink Switches mac addr 1 Physical Hubs Wire
  • 31. Routers - as few as possible  Routers can add 300-500us latency  If NFS latency is 350us (typical non-tuned) system  Then each router multiplies latency 2x, 3x, 4x etc Layer Name 3 Network IP addr Routers 2 Datalink mac Switches addr 1 Physical Wire Hubs
  • 32. Every router adds ~ 0.5 ms 6 routers = 6 * .5 = 10 3ms 8 Network Overhead 6 NFS 6.0 ms avg I/O 4 physical 2 0 Fast 0.2ms Slow 3ms
  • 33. 1. b) NICs and Hardware mismatch NICs  10GbE  1GbE  Intel (problems?)  Broadcomm (ok?) Hardware Mismatch: Speeds and duplex are often negotiated $ ethtool eth0 Settings for eth0: Example Linux: Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Half Check that values are as expected
  • 34. Hardware mismatch Full Duplex - two way simultaneous GOOD Half Duplex - one way but both lines BAD
  • 35. 1. c) Busy Network  Traffic can congest network  Caused drop packets and retransmissions  Out of order packets  Collisions on hubs, probably not with switches  Analysis  netstat  netio
  • 36. Busy Network Monitoring  Visibility difficult from any one machine  Client  Server  Switch(es) $ netstat -s -P tcp 1 TCP tcpRtoAlgorithm = 4 tcpRtoMin = 400 tcpRetransSegs = 5986 tcpRetransBytes = 8268005 tcpOutAck =49277329 tcpOutAckDelayed = 473798 tcpInDupAck = 357980 tcpInAckUnsent = 0 tcpInUnorderSegs =10048089 tcpInUnorderBytes =16611525 tcpInDupSegs = 62673 tcpInDupBytes =87945913 tcpInPartDupSegs = 15 tcpInPartDupBytes = 724 tcpRttUpdate = 4857114 tcpTimRetrans = 1191 tcpTimRetransDrop = 6 tcpTimKeepalive = 248
  • 37. Busy Network Testing Netio is available here: http://www.ars.de/ars/ars.nsf/docs/netio On Server box netio –s –t On Target box: netio -t server_machine NETIO - Network Throughput Benchmark, Version 1.31 (C) 1997-2010 Kai Uwe Rommel TCP server listening. TCP connection established ... Receiving from client, packet size 32k ... 104.37 MByte/s Sending to client, packet size 32k ... 109.27 MByte/s Done.
  • 38. 2. TCP Configuration a) MTU (Jumbo Frames) b) TCP window c) TCP congestion window
  • 39. 2. a) MTU 9000 : Jumbo Frames  MTU – maximum Transfer Unit  Typically 1500  Can be set 9000  All components have to support  Else error and/or hangs
  • 40. Jumbo Frames : MTU 9000 8K block transfer Test 1 Test 2 Default MTU 1500 Now with MTU 900 delta send recd delta send recd <-- 164 <-- 164 152 132 --> 273 8324 --> 40 1448 --> 67 1448 --> Change MTU 66 1448 --> # ifconfig eth1 mtu 9000 up 53 1448 --> 87 1448 --> Warning: MTU 9000 can hang 95 952 --> if any of the hardware in the = 560 connection is configured only for MTU 1500
  • 41. 2. b) TCP Socket Buffer  Set max  Socket Buffer  TCP Window  If maximum is reached, packets are dropped. LINUX   Socket buffer sizes  sysctl -w net.core.wmem_max=8388608  sysctl -w net.core.rmem_max=8388608  TCP Window sizes  sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608"  sysctl -w net.ipv4.tcp_wmem="4096 87380 8388608" Excellent book
  • 42. TCP window sizes  max data send/receive before acknowledgement  Subset of the TCP socket sizes Calculating = latency * throughput Ex, 1ms latency , 1Gb NIC = 1Gb/sec * 0.001s = 100Mb/sec * 1Byte/8bits= 125KB
  • 43. 2. c) Congestion window delta bytes bytes unack cong send us sent recvd bytes window window 31 1448 139760 144800 195200 33 1448 139760 144800 195200 29 1448 144104 146248 195200 31 / 0 145552 144800 195200 41 1448 145552 < 147696 195200 30 / 0 147000 > 144800 195200 22 1448 147000 76744 195200 Data collected with congestion window size Dtrace is drastically lowered But could get similar data from snoop (load data into wireshark)
  • 44. 3. NFS mount options a) Forcedirectio – be careful b) rsize , wsize - increase c) Actimeo=0, noac – don’t use (unless on RAC) Sun Solaris rw,bg,hard,rsize=32768,wsize=32768,vers=3,[forcedirectio or llock],nointr,proto=tcp,suid AIX rw,bg,hard,rsize=32768,wsize=32768,vers=3,cio,intr,timeo=600,proto=tcp HPUX rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,proto=tcp, suid, forcedirectio Linux rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,tcp,actimeo=0
  • 45. 3 a) Forcedirectio  Skip UNIX cache  read directly into SGA  Controlled by init.ora  Filesystemio_options=SETALL (or directio)  Except HPUX Sun Forcedirectio – sets forces directio but not required Solaris Fielsystemio_options will set directio without mount option AIX HPUX Forcedirectio – only way to set directio Filesystemio_options has no affect Linux
  • 46. Direct I/O Example query 77951 physical reads , 2nd execution (ie when data should already be cached)  5 secs => no direct I/O  60 secs => direct I/O  Why use direct I/O?
  • 47. Oracle data block cache (SGA) Oracle data block cache (SGA) Unix file system Cache Unix file system Cache By default get both Direct I/O can’t use Unix cache
  • 48. Direct I/O  Advantages  Faster reads from disk  Reduce CPU  Reduce memory contention  Faster access to data already in memory, in SGA  Disadvantages  Less Flexible  Risk of paging , memory pressure  Can’t share memory between multiple databases http://blogs.oracle.com/glennf/entry/where_do_you_cache_oracle
  • 49. Oracle data block Oracle data block cache (SGA) cache (SGA) Oracle data block cache (SGA) Unix file system Cache Unix file system Cache Unix file system Cache 5 seconds 60 seconds 2 seconds
  • 50. SGA Buffer UNIX Cache Hits Cache Hits Miss = UNIX Cache < 0.2 ms File System Cache Hits Miss SAN Cache Hits = NAS/SAN < 0.5 ms SAN/NA S Storage Cache Disk Reads Hits Miss Disk Reads = Disk Reads ~ 6ms Hits
  • 51. 3. b)ACTIMEO=0 , NOAC  Disable client side file attribute cache  Increases NFS calls  increases latency  Avoid on single instance Oracle  Metalink says it’s required on LINUX (calls it “actime”)  Another metalink it should be taken off => It should be taken off
  • 52. 3. c)rsize/wsize  NFS transfer buffer size  Oracle says use 32K  Platforms support higher values and can significantly impact throughput Sun rsize=32768,wsize=32768 , max is 1M Solaris AIX rsize=32768,wsize=32768 , max is 64K HPUX rsize=32768,wsize=32768 , max is 1M Linux rsize=32768,wsize=32768 , max is 1M On full table scans using 1M has halved the response time over 32K Db_file_multiblock_read_count has to large enough take advantage of the size
  • 53. Max rsize , wsize  1M  Sun Solaris Sparc  HP/UX  Linux  64K  AIX  Patch to increase to 512  http://www-01.ibm.com/support/docview.wss?uid=isg1IV24594  http://www-01.ibm.com/support/docview.wss?uid=isg1IV24688  http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=6100-08-00-1241  http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=7100-02-00-1241
  • 54. NFS Overhead Physical vs Cached IO Storage (SAN) 9 8 Cache 7 6 5 4 NFS + network reads 3 Physic 2 al 1 0 phys vs NFS phys multi phys routers cached vs nfs cached vs cached vs switch switches routers
  • 55. NFS Overhead Physical vs Cached IO 100 us + 6.0 ms 100 us + 0.1 ms -> 2x slower SAN cache is expensive -> use it for write cache Target cache is cheaper –> add more Takeaway: Move cache to client boxes • cheaper • Quicker save SAN cache for writeback
  • 56. Advanced  LINUX  Rpc queues  Iostat.py  Solaris  Client max read size 32k , boost it 1M  NFS server default threads 16  Tcpdump
  • 57. Issues: LINUX rpc queue On LINUX, in /etc/sysctl.conf modify sunrpc.tcp_slot_table_entries = 128 then do sysctl -p then check the setting using sysctl -A | grep sunrpc NFS partitions will have to be unmounted and remounted Not persistent across reboot
  • 58. Linux tools: iostat.py $ ./iostat.py -1 172.16.100.143:/vdb17 mounted on /mnt/external: op/s rpc bklog 4.20 0.00 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000
  • 59. Solaris Client max read size  nfs3_bsize  Defaults to 32K  Set it to 1M mdb -kw > nfs3_bsize/D nfs3_bsize: 32768 > nfs3_bsize/W 100000 nfs3_bsize: 0xd = 0x1000 00 >
  • 60. Issues: Solaris NFS Server threads sharectl get -p servers nfs sharectl set -p servers=512 nfs svcadm refresh nfs/server
  • 61. Tcpdump Analysis LINUX Solaris ms ms Oracl Oracle e 58 47 NFS NFS ? ? TCP /TCP Network ? ? Network TCP/NF ? ? S TCP NFS NFS .1 2 server Cache/SAN
  • 62. Oracle snoop / tcpdump TCP Network snoop TCP NFS
  • 63. Wireshark : analyze TCP dumps  yum install wireshark  wireshark + perl  find common NFS requests  NFS client  NFS server  display times for  NFS Client  NFS Server  Delta https://github.com/khailey/tcpdump
  • 64. tcpdump -s 512 host server_machine -w client.cap Oracle snoop / tcpdump TCP Network snoop TCP NFS snoop -q -d aggr0 -s 512 –o server.cap client_machine
  • 65. parsetcp.pl nfs_server.cap nfs_client.cap Parsing nfs server trace: nfs_server.cap type avg ms count READ : 44.60, 7731 Parsing client trace: nfs_client.cap type avg ms count READ : 46.54, 15282 ==================== MATCHED DATA ============ READ type avg ms server : 48.39, client : 49.42, diff : 1.03, Processed 9647 packets (Matched: 5624 Missed: 4023)
  • 66. parsetcp.pl server.cap client.cap Parsing NFS server trace: server.cap type avg ms count READ : 1.17, 9042 Parsing client trace: client.cap type avg ms count READ : 1.49, 21984 ==================== MATCHED DATA ============ READ type avg ms count server : 1.03 client : 1.49 diff : 0.46
  • 67. Linux Solaris tool source “db file sequential read” wait (which is Oracle 58 ms 47 ms oramon.sh basically a timing of “pread” for 8k random Oracle reads specifically NFS tcpdump on NFS tcpparse.s TCP 1.5 45 ms LINUX snoop Client h on Solaris network 0.5 1 ms Delta Network tcpparse.s TCP Server 1 ms 44 ms h snoop NFS dtrace nfs:::op-read- .1 ms 2 ms DTrace start/op-read-done NFS Server
  • 68. netperf.sh mss: 1448 local_recv_size (beg,end): 128000 128000 local_send_size (beg,end): 49152 49152 remote_recv_size (beg,end): 87380 3920256 remote_send_size (beg,end): 16384 16384 mn_ms av_ms max_ms s_KB r_KB r_MB/s s_MB/s <100u <500u <1ms <5ms <10ms <5 .08 .12 10.91 15.69 83.92 .33 .38 .01 .01 .12 .54 .10 .16 12.25 8 48.78 99.10 .30 .82 .07 .08 .15 .57 .10 .14 5.01 8 54.78 99.04 .88 .96 .15 .60 .22 .34 63.71 128 367.11 97.50 1.57 2.42 .06 .07 .01 .35 .93 .43 .60 16.48 128 207.71 84.86 11.75 15.04 .05 .10 .90 1.42 .99 1.30 412.42 1024 767.03 .05 99.90 .03 .08 .03 1.30 2.25 1.77 2.28 15.43 1024 439.20 99.27 .64 .73 2.65 5.35
  • 69. Summary  Test  Traceroute : routers and latency  Netio : throughput  NFS close to FC , if Network topology clean  Mount  Rsize/wsize at maximum,  Avoid actimeo=0 and noac  Use noactime  Jumbo Frames  Drawbacks  Requires clean topology  NFS failover can take 10s of seconds  Oracle 11g dNFS client mount issues handled transparently
  • 70. Conclusion: Give NFS some love NFS gigabit switch can be anywhere from 10 to 50 times cheaper than an FC switch http://github.com/khailey
  • 71. dtrace List the names of traceable probes: dtrace –ln provider:module:function:name • -l = list instead of enable probes Example • -n = Specify probe name to trace or list dtrace –ln tcp:::send • -v = Set verbose mode $ dtrace -lvn tcp:::receive 5473 tcp ip tcp_output send Argument Types args[0]: pktinfo_t * args[1]: csinfo_t * args[2]: ipinfo_t * args[3]: tcpsinfo_t * args[4]: tcpinfo_t *
  • 73.
  • 74. Dtrace tcp:::send, tcp:::receive { delta= timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8d %8s %8d %8d %8d %8d %d n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, args[2]->ip_plength - args[4]->tcp_offset, "", args[3]->tcps_swnd, args[3]->tcps_rwnd, args[3]->tcps_cwnd, args[3]->tcps_retransmit ); } tcp:::receive { delta=timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8s / %-8d %8d %8d %8d %8d %d n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_swnd, args[3]->tcps_rwnd, args[3]->tcps_cwnd, args[3]->tcps_retransmit ); }
  • 75.
  • 77. Solaris  ndd -set /dev/tcp tcp_max_buf 8388608  ndd -set /dev/tcp tcp_recv_hiwat 4194304  ndd -set /dev/tcp tcp_xmit_hiwat 4194304  ndd -set /dev/tcp tcp_cwnd_max 8388608 mdb -kw > nfs3_bsize/D nfs3_bsize: 32768 > nfs3_bsize/W 100000 nfs3_bsize: 0xd = 0x100000 > add it to /etc/system for use on reboot set nfs:nfs3_bsize= 0x100000
  • 78. Direct I/O Challenges Database Cache usage over 24 hours DB1 DB2 DB3 Europe US Asia

Editor's Notes

  1. asdf
  2. http://www.strypertech.com/index.php?option=com_content&amp;view=article&amp;id=101&amp;Itemid=172
  3. http://www.srbconsulting.com/fibre-channel-technology.htm
  4. http://www.faqs.org/photo-dict/phrase/1981/ethernet.html
  5. http://gizmodo.com/5190874/ethernet-cable-fashion-show-looks-like-a-data-center-disaster
  6. http://www.zdnetasia.com/an-introduction-to-enterprise-data-storage-62010137.htmA 16-port gigabit switch can be anywhere from 10 to 50 times cheaper than an FC switch and is far more familiar to a network engineer. Another benefit to iSCSI is that because it uses TCP/IP, it can be routed over different subnets, which means it can be used over a wide area network for data mirroring and disaster recovery.
  7. http://www.freewebs.com/skater4life38291/Lamborghini-I-Love-Speed.jpg
  8. 4Gbs /8b = 500MB/s = 500 KB/ms, 50 KB/100us, 10K/20us8K datablock, say 10K, = 20us10Gs /8b = 1250MB/s = 1250 KB/ms , 125 KB/100us 10K/8us8K block, say 10K = 8ushttp://www.unifiedcomputingblog.com/?p=1081, 2, 4, and 8 GbFibre Channel all use 8b/10b encoding.   Meaning, 8 bits of data gets encoded into 10 bits of transmitted information – the two bits are used for data integrity.   Well, if the link is 8Gb, how much do we actually get to use for data – given that 2 out of every 10 bits aren’t “user” data?   FC link speeds are somewhat of an anomaly, given that they’re actually faster than the stated link speed would suggest.   Original 1Gb FC is actually 1.0625Gb/s, and each generation has kept this standard and multiplied it.  8Gb FC would be 8×1.0625, or actual bandwidth of 8.5Gb/s.   8.5*.80 = 6.8.   6.8Gb of usable bandwidth on an 8Gb FC link.So 8Gb FC = 6.8Gb usable, while 10Gb Ethernet = 9.7Gb usable. FC=850MB/s, 10gE= 1212MB/sWith FCoE, you’re adding about 4.3% overhead over a typical FC frame. A full FC frame is 2148 bytes, a maximum sized FCoE frame is 2240 bytes. So in other words, the overhead hit you take for FCoE is significantly less than the encoding mechanism used in 1/2/4/8G FC.FCoE/Ethernet headers, which takes a 2148 byte full FC frame and encapsulates it into a 2188 byte Ethernet frame – a little less than 2% overhead. So if we take 10Gb/s, knock down for the 64/66b encoding (10*.9696 = 9.69Gb/s), then take off another 2% for Ethernet headers (9.69 * .98), we’re at 9.49 Gb/s usable. 
  9. http://virtualgeek.typepad.com/virtual_geek/2009/06/why-fcoe-why-not-just-nas-and-iscsi.htmlNFS is great and iSCSI are great, but there’s no getting away from the fact that they depend on TCP retransmission mechanics (and in the case of NFS, potentially even higher in the protocol stack if you use it over UDP - though this is not supported in VMware environments today). because of the intrinsic model of the protocol stack, the higher you go, the longer the latencies in various operations. One example (and it’s only one) - this means always seconds, and normally many tens of seconds for state/loss of connection (assuming that the target fails over instantly, which is not the case of most NAS devices). Doing it in shorter timeframes would be BAD, as in this case the target is an IP, and for an IP address to be non-reachable for seconds - is NORMAL.
  10. http://www.cisco.com/en/US/tech/tk389/tk689/technologies_tech_note09186a00800a7af3.shtml
  11. http://www.flickr.com/photos/annann/3886344843/in/set-72157621298804450