Oaktable World 2014 Toon Koppelaars: database constraints polite excuse
Collaborate nfs kyle_final
1. NFS Tuning for Oracle
Kyle Hailey
http://dboptimizer.com April 2013
2. Intro
Who am I
why NFS interests me
DAS / NAS / SAN
Throughput
Latency
NFS configuration issues*
Network topology
TCP config
NFS Mount Options
* for non-RAC, non-dNFS
3. Who is Kyle Hailey
1990 Oracle
90 support
92 Ported v6
93 France
95 Benchmarking
98 ST Real World Performance
2000 Dot.Com
2001 Quest
2002 Oracle OEM 10g Success!
First successful OEM design
4. Who is Kyle Hailey
1990 Oracle
90 support
92 Ported v6
93 France
95 Benchmarking
98 ST Real World Performance
2000 Dot.Com
2001 Quest
2002 Oracle OEM 10g
2005 Embarcadero
DB Optimizer
5. Who is Kyle Hailey
1990 Oracle
90 support
92 Ported v6
93 France
95 Benchmarking
98 ST Real World Performance
2000 Dot.Com
2001 Quest
2002 Oracle OEM 10g
2005 Embarcadero
DB Optimizer
Delphix
When not being a Geek
- Have a little 4 year old boy who takes up all my time
6. Typical Architecture
Production Development QA UAT
Instanc Instanc Instanc Instanc
e e e e
Database Database Database Database
File system File system File system File system
7. Database Virtualization
Source Clones
Production Development QA UAT
Instance
Instance Instance Instance
NFS vDatabas vDatabas vDatabas
Database
e e e
File system
Fiber Channel
14. DAS vs NAS vs SAN
attach Agile Cheap maintenanc spee
e d
DA SCSI no yes difficult fast
S
NA NFS - yes yes easy ??
S Ethernet
SA Fibre yes no difficult fast
N Channel
20. Routers: traceroute
$ traceroute 172.16.100.144
1 172.16.100.144 (172.16.100.144) 0.226 ms 0.171 ms 0.123 ms
$ traceroute 101.88.123.195
1 101.88.229.181 (101.88.229.181) 0.761 ms 0.579 ms 0.493 ms
2 101.88.255.169 (101.88.255.169) 0.310 ms 0.286 ms 0.279 ms
3 101.88.218.166 (101.88.218.166) 0.347 ms 0.300 ms 0.986 ms
4 101.88.123.195 (101.88.123.195) 1.704 ms 1.972 ms 1.263 ms
Tries each step 3 times
Last line should be the total latency
latency
~=
Length of pipe
21. Wire Speed – where is the hold up?
ms us ns
0.000 000 000
Wire 5ns/m
Light = 0.3m/ns RAM
Wire ~= 0.2m/ns
Physical 8K random disk read ~= 7
Wire ~= 5us/km
ms
Physical small sequential write ~= 1
Data Center 10m = 50ns ms
LA to London = 30ms
LA to SF = 3ms
22. 4G FC vs 10GbE
Why would FC be faster?
8K block transfer times
8GB FC = 10us
10G Ethernet = 8us
23. More stack more
latency
Oracle
NFS
TCP
NIC
Network NFS
Not FC
NIC
TCP
NFS
Filesystem
Cache/spindl
e
24. Oracle and SUN benchmark
200us overhead more for NFS
• 350us without Jumbo frames
• 1ms because of network topology
8K blocks 1GbE with Jumbo Frames , Solaris 9, Oracle 9.2
Database Performance with NAS: Optimizing Oracle on NFS
Revised May 2009 | TR-3322
http://media.netapp.com/documents/tr-3322.pdf
25. 8K block NFS latency overhead
8k wire transfer latency
1 GbE -> 80 us
10 GbE -> 8 us
Upgrading 1GbE to 10GbE
80 us – 8 us = 72 us faster 8K transfer
8K NFS on 1GbE was 200us, so on 10GbE should be
200 -72 = 128us
(0.128 ms/7 ms) * 100 = 2% latency increase over
DAS
26. NFS why the bad reputation?
Given 2 % overhead why the reputation?
Historically slower
Bad Infrastructure
1. Network topology and load
2. NFS mount options
3. TCP configuration
Compounding issues
Oracle configuration
I/O subsystem response
27. Performance issues
1. Network
a) Topology
b) NICs and mismatches
c) Load
2. TCP
a) MTU “Jumbo Frames”
b) window
c) congestion window
3. NFS mount options
29. HUBs - no HUBs allowed
Layer Name Level Device
7 Application
6 Presentation
5 Session
4 Transport
3 Network IP addr Routers
2 Datalink mac addr Switches
1 Physical Wire Hubs
• Broadcast, repeaters
• Risk collisions
• Bandwidth contention
30. Multiple Switches – generally fast
Two types of Switches
Store and Forward
1GbE 50-70us
10GbE 5-35us
Cut through
10GbE .3-.5us
Laye Name
r
3 Network Routers IP addr
2 Datalink Switches mac
addr
1 Physical Hubs Wire
31. Routers - as few as possible
Routers can add 300-500us latency
If NFS latency is 350us (typical non-tuned) system
Then each router multiplies latency 2x, 3x, 4x etc
Layer Name
3 Network IP addr Routers
2 Datalink mac Switches
addr
1 Physical Wire Hubs
32. Every router adds
~ 0.5 ms
6 routers = 6 * .5 =
10
3ms
8
Network Overhead
6
NFS
6.0 ms avg I/O 4 physical
2
0
Fast 0.2ms Slow 3ms
33. 1. b) NICs and Hardware mismatch
NICs
10GbE
1GbE
Intel (problems?)
Broadcomm (ok?)
Hardware Mismatch: Speeds and duplex are often
negotiated
$ ethtool eth0
Settings for eth0:
Example Linux: Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Half
Check that values are as expected
34. Hardware mismatch
Full Duplex - two way simultaneous GOOD
Half Duplex - one way but both lines BAD
35. 1. c) Busy Network
Traffic can congest network
Caused drop packets and retransmissions
Out of order packets
Collisions on hubs, probably not with switches
Analysis
netstat
netio
39. 2. a) MTU 9000 : Jumbo Frames
MTU – maximum Transfer Unit
Typically 1500
Can be set 9000
All components have to support
Else error and/or hangs
40. Jumbo Frames : MTU 9000
8K block transfer
Test 1 Test 2
Default MTU 1500 Now with MTU 900
delta send recd delta send recd
<-- 164 <-- 164
152 132 --> 273 8324 -->
40 1448 -->
67 1448 --> Change MTU
66 1448 --> # ifconfig eth1 mtu 9000 up
53 1448 -->
87 1448 --> Warning: MTU 9000 can hang
95 952 --> if any of the hardware in the
= 560 connection is configured only
for MTU 1500
41. 2. b) TCP Socket Buffer
Set max
Socket Buffer
TCP Window
If maximum is reached, packets are dropped.
LINUX
Socket buffer sizes
sysctl -w net.core.wmem_max=8388608
sysctl -w net.core.rmem_max=8388608
TCP Window sizes
sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608"
sysctl -w net.ipv4.tcp_wmem="4096 87380 8388608"
Excellent book
42. TCP window sizes
max data send/receive before acknowledgement
Subset of the TCP socket sizes
Calculating
= latency * throughput
Ex, 1ms latency , 1Gb NIC
= 1Gb/sec * 0.001s = 100Mb/sec * 1Byte/8bits= 125KB
43. 2. c) Congestion window
delta bytes bytes unack cong send
us sent recvd bytes window window
31 1448 139760 144800 195200
33 1448 139760 144800 195200
29 1448 144104 146248 195200
31 / 0 145552 144800 195200
41 1448 145552 < 147696 195200
30 / 0 147000 > 144800 195200
22 1448 147000 76744 195200
Data collected with congestion window size
Dtrace is drastically lowered
But could get similar data from
snoop (load data into
wireshark)
44. 3. NFS mount options
a) Forcedirectio – be careful
b) rsize , wsize - increase
c) Actimeo=0, noac – don’t use (unless on RAC)
Sun Solaris rw,bg,hard,rsize=32768,wsize=32768,vers=3,[forcedirectio or llock],nointr,proto=tcp,suid
AIX rw,bg,hard,rsize=32768,wsize=32768,vers=3,cio,intr,timeo=600,proto=tcp
HPUX rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,proto=tcp, suid, forcedirectio
Linux rw,bg,hard,rsize=32768,wsize=32768,vers=3,nointr,timeo=600,tcp,actimeo=0
45. 3 a) Forcedirectio
Skip UNIX cache
read directly into SGA
Controlled by init.ora
Filesystemio_options=SETALL (or directio)
Except HPUX
Sun Forcedirectio – sets forces directio but not required
Solaris
Fielsystemio_options will set directio without mount option
AIX
HPUX Forcedirectio – only way to set directio
Filesystemio_options has no affect
Linux
46. Direct I/O
Example query
77951 physical reads , 2nd execution
(ie when data should already be cached)
5 secs => no direct I/O
60 secs => direct I/O
Why use direct I/O?
47. Oracle data block cache (SGA) Oracle data block cache (SGA)
Unix file system Cache Unix file system Cache
By default get both Direct I/O can’t use Unix cache
48. Direct I/O
Advantages
Faster reads from disk
Reduce CPU
Reduce memory contention
Faster access to data already in memory, in SGA
Disadvantages
Less Flexible
Risk of paging , memory pressure
Can’t share memory between multiple databases
http://blogs.oracle.com/glennf/entry/where_do_you_cache_oracle
49. Oracle data block Oracle data block
cache (SGA) cache (SGA)
Oracle data block
cache (SGA)
Unix file system Cache Unix file system Cache
Unix file system
Cache
5 seconds 60 seconds 2 seconds
50. SGA Buffer
UNIX Cache Hits
Cache
Hits Miss
= UNIX Cache
< 0.2 ms File System Cache
Hits Miss
SAN Cache Hits
= NAS/SAN
< 0.5 ms
SAN/NA
S
Storage Cache
Disk Reads Hits Miss
Disk Reads
= Disk Reads
~ 6ms
Hits
51. 3. b)ACTIMEO=0 , NOAC
Disable client side file attribute cache
Increases NFS calls
increases latency
Avoid on single instance Oracle
Metalink says it’s required on LINUX (calls it “actime”)
Another metalink it should be taken off
=> It should be taken off
52. 3. c)rsize/wsize
NFS transfer buffer size
Oracle says use 32K
Platforms support higher values and can significantly
impact throughput
Sun rsize=32768,wsize=32768 , max is 1M
Solaris
AIX rsize=32768,wsize=32768 , max is 64K
HPUX rsize=32768,wsize=32768 , max is 1M
Linux rsize=32768,wsize=32768 , max is 1M
On full table scans using 1M has halved the response time over 32K
Db_file_multiblock_read_count has to large enough take advantage of the
size
53. Max rsize , wsize
1M
Sun Solaris Sparc
HP/UX
Linux
64K
AIX
Patch to increase to 512
http://www-01.ibm.com/support/docview.wss?uid=isg1IV24594
http://www-01.ibm.com/support/docview.wss?uid=isg1IV24688
http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=6100-08-00-1241
http://www-933.ibm.com/support/fixcentral/aix/fixpackdetails?fixid=7100-02-00-1241
54. NFS Overhead Physical vs Cached IO
Storage
(SAN)
9
8
Cache 7
6
5
4 NFS + network
reads
3
Physic 2
al 1
0
phys vs NFS phys multi phys routers cached vs nfs cached vs cached vs
switch switches routers
55. NFS Overhead Physical vs Cached IO
100 us + 6.0 ms
100 us + 0.1 ms -> 2x slower
SAN cache is expensive -> use it for write
cache
Target cache is cheaper –> add more
Takeaway:
Move cache to client boxes
• cheaper
• Quicker
save SAN cache for writeback
56. Advanced
LINUX
Rpc queues
Iostat.py
Solaris
Client max read size 32k , boost it 1M
NFS server default threads 16
Tcpdump
57. Issues: LINUX rpc queue
On LINUX, in /etc/sysctl.conf modify
sunrpc.tcp_slot_table_entries = 128
then do
sysctl -p
then check the setting using
sysctl -A | grep sunrpc
NFS partitions will have to be unmounted and remounted
Not persistent across reboot
65. parsetcp.pl nfs_server.cap nfs_client.cap
Parsing nfs server trace: nfs_server.cap
type avg ms count
READ : 44.60, 7731
Parsing client trace: nfs_client.cap
type avg ms count
READ : 46.54, 15282
==================== MATCHED DATA ============
READ
type avg ms
server : 48.39,
client : 49.42,
diff : 1.03,
Processed 9647 packets (Matched: 5624 Missed: 4023)
66. parsetcp.pl server.cap client.cap
Parsing NFS server trace: server.cap
type avg ms count
READ : 1.17, 9042
Parsing client trace: client.cap
type avg ms count
READ : 1.49, 21984
==================== MATCHED DATA ============
READ
type avg ms count
server : 1.03
client : 1.49
diff : 0.46
67. Linux Solaris tool source
“db file sequential
read” wait (which is
Oracle 58 ms 47 ms oramon.sh basically a timing of
“pread” for 8k random
Oracle reads specifically
NFS tcpdump on
NFS tcpparse.s
TCP 1.5 45 ms LINUX snoop
Client h on Solaris
network 0.5 1 ms Delta
Network
tcpparse.s
TCP Server 1 ms 44 ms h snoop
NFS dtrace nfs:::op-read-
.1 ms 2 ms DTrace start/op-read-done
NFS Server
69. Summary
Test
Traceroute : routers and latency
Netio : throughput
NFS close to FC , if Network topology clean
Mount
Rsize/wsize at maximum,
Avoid actimeo=0 and noac
Use noactime
Jumbo Frames
Drawbacks
Requires clean topology
NFS failover can take 10s of seconds
Oracle 11g dNFS client mount issues handled
transparently
70. Conclusion: Give NFS some love
NFS
gigabit switch can be anywhere from
10 to 50 times cheaper than an FC switch
http://github.com/khailey
71. dtrace
List the names of traceable probes:
dtrace –ln provider:module:function:name
• -l = list instead of enable probes Example
• -n = Specify probe name to trace or list dtrace –ln tcp:::send
• -v = Set verbose mode $ dtrace -lvn tcp:::receive
5473 tcp ip tcp_output send
Argument Types
args[0]: pktinfo_t *
args[1]: csinfo_t *
args[2]: ipinfo_t *
args[3]: tcpsinfo_t *
args[4]: tcpinfo_t *
http://www.zdnetasia.com/an-introduction-to-enterprise-data-storage-62010137.htmA 16-port gigabit switch can be anywhere from 10 to 50 times cheaper than an FC switch and is far more familiar to a network engineer. Another benefit to iSCSI is that because it uses TCP/IP, it can be routed over different subnets, which means it can be used over a wide area network for data mirroring and disaster recovery.
4Gbs /8b = 500MB/s = 500 KB/ms, 50 KB/100us, 10K/20us8K datablock, say 10K, = 20us10Gs /8b = 1250MB/s = 1250 KB/ms , 125 KB/100us 10K/8us8K block, say 10K = 8ushttp://www.unifiedcomputingblog.com/?p=1081, 2, 4, and 8 GbFibre Channel all use 8b/10b encoding. Meaning, 8 bits of data gets encoded into 10 bits of transmitted information – the two bits are used for data integrity. Well, if the link is 8Gb, how much do we actually get to use for data – given that 2 out of every 10 bits aren’t “user” data? FC link speeds are somewhat of an anomaly, given that they’re actually faster than the stated link speed would suggest. Original 1Gb FC is actually 1.0625Gb/s, and each generation has kept this standard and multiplied it. 8Gb FC would be 8×1.0625, or actual bandwidth of 8.5Gb/s. 8.5*.80 = 6.8. 6.8Gb of usable bandwidth on an 8Gb FC link.So 8Gb FC = 6.8Gb usable, while 10Gb Ethernet = 9.7Gb usable. FC=850MB/s, 10gE= 1212MB/sWith FCoE, you’re adding about 4.3% overhead over a typical FC frame. A full FC frame is 2148 bytes, a maximum sized FCoE frame is 2240 bytes. So in other words, the overhead hit you take for FCoE is significantly less than the encoding mechanism used in 1/2/4/8G FC.FCoE/Ethernet headers, which takes a 2148 byte full FC frame and encapsulates it into a 2188 byte Ethernet frame – a little less than 2% overhead. So if we take 10Gb/s, knock down for the 64/66b encoding (10*.9696 = 9.69Gb/s), then take off another 2% for Ethernet headers (9.69 * .98), we’re at 9.49 Gb/s usable.
http://virtualgeek.typepad.com/virtual_geek/2009/06/why-fcoe-why-not-just-nas-and-iscsi.htmlNFS is great and iSCSI are great, but there’s no getting away from the fact that they depend on TCP retransmission mechanics (and in the case of NFS, potentially even higher in the protocol stack if you use it over UDP - though this is not supported in VMware environments today). because of the intrinsic model of the protocol stack, the higher you go, the longer the latencies in various operations. One example (and it’s only one) - this means always seconds, and normally many tens of seconds for state/loss of connection (assuming that the target fails over instantly, which is not the case of most NAS devices). Doing it in shorter timeframes would be BAD, as in this case the target is an IP, and for an IP address to be non-reachable for seconds - is NORMAL.