Collaborate nfs kyle_final

NFS Tuning for Oracle

Kyle Hailey
http://dboptimizer.com April 2013

Intro
 Who am I
 why NFS interests me
 DAS / NAS / SAN
 Throughput
 Latency
 NFS configuration issues*
 Network topology
 TCP config
 NFS Mount Options

* for non-RAC, non-dNFS

Who is Kyle Hailey

 1990 Oracle
 90 support
 92 Ported v6
 93 France
 95 Benchmarking
 98 ST Real World Performance

 2000 Dot.Com
 2001 Quest
 2002 Oracle OEM 10g Success!
First successful OEM design

Who is Kyle Hailey

 1990 Oracle
 90 support
 92 Ported v6
 93 France
 95 Benchmarking

 2000 Dot.Com
 2001 Quest
 2002 Oracle OEM 10g
 2005 Embarcadero
 DB Optimizer

Who is Kyle Hailey

 1990 Oracle
 90 support
 92 Ported v6
 93 France
 95 Benchmarking

 2000 Dot.Com
 2001 Quest
 2002 Oracle OEM 10g
 2005 Embarcadero
 DB Optimizer
 Delphix

When not being a Geek
- Have a little 4 year old boy who takes up all my time

Typical Architecture

Production Development QA UAT

Instanc Instanc Instanc Instanc
e e e e

Database Database Database Database

File system File system File system File system

Database Virtualization

Source Clones
Production Development QA UAT

Instance
Instance Instance Instance

NFS vDatabas vDatabas vDatabas
Database
e e e

File system

Fiber Channel

NFS is attractive but is it fast enough?

DAS vs NAS vs SAN

attach Agile Cheap maintenanc spee
e d
DA SCSI no yes difficult fast
S
NA NFS - yes yes easy ??
S Ethernet
SA Fibre yes no difficult fast
N Channel

speed

Ethernet
• 100Mb 1994
• 1GbE - 1998
• 10GbE – 2003
• 40GbE – 2012
• 100GE –est. 2013

Fibre Channel
• 1G 1998
• 2G 2003
• 4G – 2005
• 8G – 2008
• 16G – 2011

Throughput
8 bits = 1 Byte

 1GbE ~= 100 MB/sec
 30-60MB/sec typical, single threaded, mtu 1500
 90-115MB clean topology
 125MB/sec max
 10GbE ~= 1000 MB/sec

Throughput
~=
size of pipe

Throughput : netio

Server Client
Throughput
~=
size of pipe

Server
netio –s -t
Client
netio –t server
Packet size 1k bytes: 51.30 MByte/s Tx, 6.17 MByte/s Rx.

Routers: traceroute

$ traceroute 172.16.100.144
1 172.16.100.144 (172.16.100.144) 0.226 ms 0.171 ms 0.123 ms

$ traceroute 101.88.123.195
1 101.88.229.181 (101.88.229.181) 0.761 ms 0.579 ms 0.493 ms
2 101.88.255.169 (101.88.255.169) 0.310 ms 0.286 ms 0.279 ms
3 101.88.218.166 (101.88.218.166) 0.347 ms 0.300 ms 0.986 ms
4 101.88.123.195 (101.88.123.195) 1.704 ms 1.972 ms 1.263 ms
Tries each step 3 times
Last line should be the total latency

latency
~=
Length of pipe

Wire Speed – where is the hold up?

ms us ns
0.000 000 000
Wire 5ns/m
Light = 0.3m/ns RAM
Wire ~= 0.2m/ns
Physical 8K random disk read ~= 7
Wire ~= 5us/km
ms
Physical small sequential write ~= 1
Data Center 10m = 50ns ms

LA to London = 30ms
LA to SF = 3ms

4G FC vs 10GbE

Why would FC be faster?
8K block transfer times

 8GB FC = 10us

 10G Ethernet = 8us

More stack more
latency
Oracle
NFS
TCP
NIC

Network NFS
Not FC

NIC
TCP
NFS
Filesystem
Cache/spindl
e

Oracle and SUN benchmark

200us overhead more for NFS
• 350us without Jumbo frames
• 1ms because of network topology

8K blocks 1GbE with Jumbo Frames , Solaris 9, Oracle 9.2
Database Performance with NAS: Optimizing Oracle on NFS
Revised May 2009 | TR-3322
http://media.netapp.com/documents/tr-3322.pdf

8K block NFS latency overhead

8k wire transfer latency
 1 GbE -> 80 us
 10 GbE -> 8 us

Upgrading 1GbE to 10GbE
 80 us – 8 us = 72 us faster 8K transfer

8K NFS on 1GbE was 200us, so on 10GbE should be
 200 -72 = 128us

(0.128 ms/7 ms) * 100 = 2% latency increase over
DAS

NFS why the bad reputation?

 Given 2 % overhead why the reputation?
 Historically slower
 Bad Infrastructure
1. Network topology and load
2. NFS mount options
3. TCP configuration
 Compounding issues
 Oracle configuration
 I/O subsystem response

Performance issues
1. Network
a) Topology
b) NICs and mismatches
c) Load
2. TCP
a) MTU “Jumbo Frames”
b) window
c) congestion window

1. a) Network Topology

 Hubs
 Routers
 Switches

HUBs - no HUBs allowed
Layer Name Level Device
7 Application
6 Presentation
5 Session
4 Transport

3 Network IP addr Routers
2 Datalink mac addr Switches
1 Physical Wire Hubs
• Broadcast, repeaters
• Risk collisions
• Bandwidth contention

Multiple Switches – generally fast

 Two types of Switches
 Store and Forward
 1GbE 50-70us
 10GbE 5-35us
 Cut through
 10GbE .3-.5us

Laye Name
r
3 Network Routers IP addr
2 Datalink Switches mac
addr
1 Physical Hubs Wire

Routers - as few as possible
 Routers can add 300-500us latency
 If NFS latency is 350us (typical non-tuned) system
 Then each router multiplies latency 2x, 3x, 4x etc

Layer Name

3 Network IP addr Routers

2 Datalink mac Switches
addr
1 Physical Wire Hubs

Every router adds
~ 0.5 ms
6 routers = 6 * .5 =
10
3ms
8
Network Overhead
6
NFS
6.0 ms avg I/O 4 physical

2

0
Fast 0.2ms Slow 3ms

1. b) NICs and Hardware mismatch

NICs
 10GbE
 1GbE
 Intel (problems?)
 Broadcomm (ok?)

Hardware Mismatch: Speeds and duplex are often
negotiated
$ ethtool eth0
Settings for eth0:
Example Linux: Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Half

Check that values are as expected

Hardware mismatch

Full Duplex - two way simultaneous GOOD

Half Duplex - one way but both lines BAD

1. c) Busy Network
 Traffic can congest network
 Caused drop packets and retransmissions
 Out of order packets
 Collisions on hubs, probably not with switches
 Analysis
 netstat
 netio

Busy Network Monitoring

 Visibility difficult from any one machine
 Client
 Server
 Switch(es)
$ netstat -s -P tcp 1
TCP tcpRtoAlgorithm = 4 tcpRtoMin = 400
tcpRetransSegs = 5986 tcpRetransBytes = 8268005
tcpOutAck =49277329 tcpOutAckDelayed = 473798
tcpInDupAck = 357980 tcpInAckUnsent = 0
tcpInUnorderSegs =10048089 tcpInUnorderBytes =16611525
tcpInDupSegs = 62673 tcpInDupBytes =87945913
tcpInPartDupSegs = 15 tcpInPartDupBytes = 724
tcpRttUpdate = 4857114 tcpTimRetrans = 1191
tcpTimRetransDrop = 6 tcpTimKeepalive = 248

Busy Network Testing
Netio is available here:
http://www.ars.de/ars/ars.nsf/docs/netio
On Server box
netio –s –t

On Target box:
netio -t server_machine

NETIO - Network Throughput Benchmark, Version 1.31
(C) 1997-2010 Kai Uwe Rommel
TCP server listening.
TCP connection established ...
Receiving from client, packet size 32k ... 104.37 MByte/s
Sending to client, packet size 32k ... 109.27 MByte/s
Done.

2. TCP Configuration

a) MTU (Jumbo Frames)
b) TCP window
c) TCP congestion window

2. a) MTU 9000 : Jumbo Frames

 MTU – maximum Transfer Unit
 Typically 1500
 Can be set 9000
 All components have to support
 Else error and/or hangs

Jumbo Frames : MTU 9000

8K block transfer
Test 1 Test 2
Default MTU 1500 Now with MTU 900
delta send recd delta send recd
<-- 164 <-- 164
152 132 --> 273 8324 -->
40 1448 -->
67 1448 --> Change MTU
66 1448 --> # ifconfig eth1 mtu 9000 up
53 1448 -->
87 1448 --> Warning: MTU 9000 can hang
95 952 --> if any of the hardware in the
= 560 connection is configured only
for MTU 1500

2. b) TCP Socket Buffer
 Set max
 Socket Buffer
 TCP Window
 If maximum is reached, packets are dropped.

LINUX

 Socket buffer sizes
 sysctl -w net.core.wmem_max=8388608
 sysctl -w net.core.rmem_max=8388608
 TCP Window sizes
 sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608"
 sysctl -w net.ipv4.tcp_wmem="4096 87380 8388608"

Excellent book

TCP window sizes

 max data send/receive before acknowledgement
 Subset of the TCP socket sizes
Calculating

= latency * throughput

Ex, 1ms latency , 1Gb NIC

= 1Gb/sec * 0.001s = 100Mb/sec * 1Byte/8bits= 125KB

2. c) Congestion window
delta bytes bytes unack cong send
us sent recvd bytes window window

31 1448 139760 144800 195200
33 1448 139760 144800 195200
29 1448 144104 146248 195200
31 / 0 145552 144800 195200
41 1448 145552 < 147696 195200
30 / 0 147000 > 144800 195200
22 1448 147000 76744 195200
Data collected with congestion window size
Dtrace is drastically lowered
But could get similar data from
snoop (load data into
wireshark)


a) Forcedirectio – be careful
b) rsize , wsize - increase
c) Actimeo=0, noac – don’t use (unless on RAC)

Sun Solaris rw,bg,hard,rsize=32768,wsize=32768,vers=3,[forcedirectio or llock],nointr,proto=tcp,suid

AIX rw,bg,hard,rsize=32768,wsize=32768,vers=3,cio,intr,timeo=600,proto=tcp

HPUX rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,proto=tcp, suid, forcedirectio

Linux rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,tcp,actimeo=0

3 a) Forcedirectio

 Skip UNIX cache
 read directly into SGA
 Controlled by init.ora
 Filesystemio_options=SETALL (or directio)
 Except HPUX

Sun Forcedirectio – sets forces directio but not required
Solaris
Fielsystemio_options will set directio without mount option
AIX

HPUX Forcedirectio – only way to set directio

Filesystemio_options has no affect
Linux

Direct I/O

Example query

77951 physical reads , 2nd execution
(ie when data should already be cached)

 5 secs => no direct I/O
 60 secs => direct I/O

 Why use direct I/O?

Oracle data block cache (SGA) Oracle data block cache (SGA)

Unix file system Cache Unix file system Cache

By default get both Direct I/O can’t use Unix cache

Direct I/O

 Advantages
 Faster reads from disk
 Reduce CPU
 Reduce memory contention
 Faster access to data already in memory, in SGA
 Disadvantages
 Less Flexible
 Risk of paging , memory pressure
 Can’t share memory between multiple databases

http://blogs.oracle.com/glennf/entry/where_do_you_cache_oracle

Oracle data block Oracle data block
cache (SGA) cache (SGA)
Oracle data block
cache (SGA)

Unix file system Cache Unix file system Cache
Unix file system
Cache

5 seconds 60 seconds 2 seconds

SGA Buffer
UNIX Cache Hits
Cache
Hits Miss
= UNIX Cache
< 0.2 ms File System Cache
Hits Miss

SAN Cache Hits
= NAS/SAN
< 0.5 ms
SAN/NA
S
Storage Cache
Disk Reads Hits Miss
Disk Reads
= Disk Reads
~ 6ms
Hits

3. b)ACTIMEO=0 , NOAC

 Disable client side file attribute cache
 Increases NFS calls
 increases latency
 Avoid on single instance Oracle
 Metalink says it’s required on LINUX (calls it “actime”)
 Another metalink it should be taken off

=> It should be taken off

3. c)rsize/wsize

 NFS transfer buffer size
 Oracle says use 32K
 Platforms support higher values and can significantly
impact throughput
Sun rsize=32768,wsize=32768 , max is 1M
Solaris
AIX rsize=32768,wsize=32768 , max is 64K

HPUX rsize=32768,wsize=32768 , max is 1M

Linux rsize=32768,wsize=32768 , max is 1M

On full table scans using 1M has halved the response time over 32K
Db_file_multiblock_read_count has to large enough take advantage of the
size

Max rsize , wsize

 1M
 Sun Solaris Sparc
 HP/UX
 Linux
 64K
 AIX
 Patch to increase to 512
 http://www-01.ibm.com/support/docview.wss?uid=isg1IV24594
 http://www-01.ibm.com/support/docview.wss?uid=isg1IV24688

 http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=6100-08-00-1241
 http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=7100-02-00-1241

NFS Overhead Physical vs Cached IO

Storage
(SAN)
9

8

Cache 7

6

5

4 NFS + network
reads
3

Physic 2

al 1

0
phys vs NFS phys multi phys routers cached vs nfs cached vs cached vs
switch switches routers

NFS Overhead Physical vs Cached IO

100 us + 6.0 ms
100 us + 0.1 ms -> 2x slower

SAN cache is expensive -> use it for write
cache
Target cache is cheaper –> add more
Takeaway:

Move cache to client boxes
• cheaper
• Quicker

save SAN cache for writeback

Advanced
 LINUX
 Rpc queues
 Iostat.py
 Solaris
 Client max read size 32k , boost it 1M
 NFS server default threads 16
 Tcpdump

Issues: LINUX rpc queue

On LINUX, in /etc/sysctl.conf modify

sunrpc.tcp_slot_table_entries = 128

then do

sysctl -p

then check the setting using

sysctl -A | grep sunrpc

NFS partitions will have to be unmounted and remounted
Not persistent across reboot

Linux tools: iostat.py

$ ./iostat.py -1

172.16.100.143:/vdb17 mounted on /mnt/external:

op/s rpc bklog
4.20 0.00

read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms)
0.000 0.000 0.000 0 (0.0%) 0.000 0.000
write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms)
0.000 0.000 0.000 0 (0.0%) 0.000 0.000

Solaris Client max read size
 nfs3_bsize
 Defaults to 32K
 Set it to 1M

mdb -kw
> nfs3_bsize/D
nfs3_bsize: 32768
> nfs3_bsize/W 100000

nfs3_bsize: 0xd = 0x1000
00
>

Issues: Solaris NFS Server threads

sharectl get -p servers nfs

sharectl set -p servers=512 nfs
svcadm refresh nfs/server

Tcpdump
Analysis
LINUX Solaris
ms ms

Oracl Oracle
e
58 47
NFS
NFS
? ? TCP
/TCP
Network ? ?
Network
TCP/NF
? ?
S
TCP

NFS NFS
.1 2
server Cache/SAN

Oracle

snoop / tcpdump TCP

Network
snoop
TCP

NFS

Wireshark : analyze TCP dumps
 yum install wireshark
 wireshark + perl
 find common NFS requests
 NFS client
 NFS server
 display times for
 NFS Client
 NFS Server
 Delta

https://github.com/khailey/tcpdump

tcpdump -s 512 host server_machine -w client.cap

Oracle

snoop / tcpdump TCP

Network
snoop
TCP

NFS

snoop -q -d aggr0 -s 512 –o server.cap client_machine

parsetcp.pl nfs_server.cap nfs_client.cap
Parsing nfs server trace: nfs_server.cap
type avg ms count
READ : 44.60, 7731

Parsing client trace: nfs_client.cap
type avg ms count
READ : 46.54, 15282

==================== MATCHED DATA ============
READ
type avg ms
server : 48.39,
client : 49.42,
diff : 1.03,
Processed 9647 packets (Matched: 5624 Missed: 4023)

parsetcp.pl server.cap client.cap
Parsing NFS server trace: server.cap
type avg ms count
READ : 1.17, 9042

Parsing client trace: client.cap
type avg ms count
READ : 1.49, 21984

==================== MATCHED DATA ============
READ
type avg ms count
server : 1.03
client : 1.49
diff : 0.46

Linux Solaris tool source
“db file sequential
read” wait (which is
Oracle 58 ms 47 ms oramon.sh basically a timing of
“pread” for 8k random
Oracle reads specifically
NFS tcpdump on
NFS tcpparse.s
TCP 1.5 45 ms LINUX snoop
Client h on Solaris

network 0.5 1 ms Delta
Network

tcpparse.s
TCP Server 1 ms 44 ms h snoop

NFS dtrace nfs:::op-read-
.1 ms 2 ms DTrace start/op-read-done
NFS Server

netperf.sh

mss: 1448
local_recv_size (beg,end): 128000 128000
local_send_size (beg,end): 49152 49152
remote_recv_size (beg,end): 87380 3920256
remote_send_size (beg,end): 16384 16384

mn_ms av_ms max_ms s_KB r_KB r_MB/s s_MB/s <100u <500u <1ms <5ms <10ms <5
.08 .12 10.91 15.69 83.92 .33 .38 .01 .01 .12 .54
.10 .16 12.25 8 48.78 99.10 .30 .82 .07 .08 .15 .57
.10 .14 5.01 8 54.78 99.04 .88 .96 .15 .60
.22 .34 63.71 128 367.11 97.50 1.57 2.42 .06 .07 .01 .35 .93
.43 .60 16.48 128 207.71 84.86 11.75 15.04 .05 .10 .90 1.42
.99 1.30 412.42 1024 767.03 .05 99.90 .03 .08 .03 1.30 2.25
1.77 2.28 15.43 1024 439.20 99.27 .64 .73 2.65 5.35

Summary
 Test
 Traceroute : routers and latency
 Netio : throughput
 NFS close to FC , if Network topology clean
 Mount
 Rsize/wsize at maximum,
 Avoid actimeo=0 and noac
 Use noactime
 Jumbo Frames
 Drawbacks
 Requires clean topology
 NFS failover can take 10s of seconds
 Oracle 11g dNFS client mount issues handled
transparently

Conclusion: Give NFS some love

NFS

gigabit switch can be anywhere from
10 to 50 times cheaper than an FC switch

http://github.com/khailey

dtrace

List the names of traceable probes:

dtrace –ln provider:module:function:name

• -l = list instead of enable probes Example
• -n = Specify probe name to trace or list dtrace –ln tcp:::send
• -v = Set verbose mode $ dtrace -lvn tcp:::receive
5473 tcp ip tcp_output send

Argument Types
args[0]: pktinfo_t *
args[1]: csinfo_t *
args[2]: ipinfo_t *
args[3]: tcpsinfo_t *
args[4]: tcpinfo_t *

http://cvs.opensolaris.org/source/

Dtrace
tcp:::send, tcp:::receive
{ delta= timestamp-walltime;
walltime=timestamp;
printf("%6d %6d %6d %8d %8s %8d %8d %8d %8d %d n",
args[3]->tcps_snxt - args[3]->tcps_suna ,
args[3]->tcps_rnxt - args[3]->tcps_rack,
delta/1000,
args[2]->ip_plength - args[4]->tcp_offset,
"",
args[3]->tcps_swnd,
args[3]->tcps_rwnd,
args[3]->tcps_cwnd,
args[3]->tcps_retransmit
);
}
tcp:::receive
{ delta=timestamp-walltime;
walltime=timestamp;
printf("%6d %6d %6d %8s / %-8d %8d %8d %8d %8d %d n",
args[3]->tcps_snxt - args[3]->tcps_suna ,
args[3]->tcps_rnxt - args[3]->tcps_rack,
delta/1000,
"",
args[2]->ip_plength - args[4]->tcp_offset,
args[3]->tcps_swnd,
args[3]->tcps_rwnd,
args[3]->tcps_cwnd,
args[3]->tcps_retransmit
);
}

Solaris

 ndd -set /dev/tcp tcp_max_buf 8388608
 ndd -set /dev/tcp tcp_recv_hiwat 4194304
 ndd -set /dev/tcp tcp_xmit_hiwat 4194304
 ndd -set /dev/tcp tcp_cwnd_max 8388608

mdb -kw
> nfs3_bsize/D
nfs3_bsize: 32768
> nfs3_bsize/W 100000

nfs3_bsize: 0xd = 0x100000
>

add it to /etc/system for use on reboot

set nfs:nfs3_bsize= 0x100000

Direct I/O Challenges

Database Cache usage over 24 hours

DB1 DB2 DB3
Europe US Asia

Collaborate nfs kyle_final

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Collaborate nfs kyle_final

Similar to Collaborate nfs kyle_final (20)

More from Kyle Hailey

More from Kyle Hailey (20)

Collaborate nfs kyle_final

Editor's Notes