A database server health check involves analyzing the hardware, operating system, database software, and application to ensure optimal performance. Key aspects to monitor include response times, server load, capacity, and signs of potential issues. Regular load testing can help identify performance bottlenecks by simulating expected usage patterns and measuring system behavior.
3. Program of Treatment
● What is a Healthy Database?
● Know Your Application
● Load Testing
● Doing a database server checkup
● hardware
● OS & FS
● PostgreSQL
● application
● Common Ailments of the Database Server
5. What is a Healthy Database
Server?
● Response Times
6. What is a Healthy Database
Server?
● Response Times
● lower than required
● consistent & predicable
● Capacity for more
● CPU and I/O headroom
● low server load
7. 30
25
Median Response Time
20
Expected Load
15
10 Max Response Time
5
0
25 50 75 100 125 150 175 200 225 250
Number of Clients
8. What is an Unhealthy Database
Server?
● Slow response times
● Inconsistent response times
● High server load
● No capacity for growth
9. 30
25
Median Response Time
20
Expected Load
15
10 Max Response Time
5
0
25 50 75 100 125 150 175 200 225 250
Number of Clients
10. A healthy database server is
able to maintain consistent
and acceptable response times
under expected loads with
margin for error.
11. 30
25
Median Response Time
20
15
10
5
0
25 50 75 100 125 150 175 200 225 250
Number of Clients
21. Application database usage
Which does your application do?
✔ small reads
✔ large sequential reads
✔ small writes
✔ large writes
✔ long-running procedures/transactions
✔ bulk loads and/or ETL
22. What Color Is My Application?
W ● Web Application (Web)
O ● Online Transaction Processing (OLTP)
D ● Data Warehousing (DW)
23. What Color Is My Application?
W ● Web Application (Web)
● DB much smaller than RAM
● 90% or more simple queries
O ● Online Transaction Processing (OLTP)
D ● Data Warehousing (DW)
24. What Color Is My Application?
W ● Web Application (Web)
● DB smaller than RAM
● 90% or more simple queries
O ● Online Transaction Processing (OLTP)
● DB slightly larger than RAM to 1TB
● 20-40% small data write queries
● Some long transactions and complex read queries
D ● Data Warehousing (DW)
25. What Color Is My Application?
W ● Web Application (Web)
● DB smaller than RAM
● 90% or more simple queries
O ● Online Transaction Processing (OLTP)
● DB slightly larger than RAM to 1TB
● 20-40% small data write queries
● Some long transactions and complex read queries
D ● Data Warehousing (DW)
● Large to huge databases (100GB to 100TB)
● Large complex reporting queries
● Large bulk loads of data
● Also called "Decision Support" or "Business Intelligence"
26. What Color Is My Application?
W ● Web Application (Web)
● CPU-bound
● Ailments: idle connections/transactions, too many queries
O ● Online Transaction Processing (OLTP)
● CPU or I/O bound
● Ailments: locks, database growth, idle transactions,
database bloat
D ● Data Warehousing (DW)
● I/O or RAM bound
● Resources: database growth, longer running queries,
memory usage growth
27. Special features required?
● GIS
● heavy cpu for GIS functions
● lots of RAM for GIS indexes
● TSearch
● lots of RAM for indexes
● slow response time on writes
● SSL
● response time lag on connections
29. 80
70
60
Requests Per Second
50
40
30
20
10
0
02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM
12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM
Time
30. 80
70
DOWNTIME
60
Requests Per Second
50
40
30
20
10
0
02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM
12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM
Time
32. What to load test
● Load should be as similar as possible to your
production traffic
● You should be able to create your target level of
traffic
● better: incremental increases
● Test the whole application as well
● the database server may not be your weak point
33. How to Load Test
1. Set up a load testing tool
you'll need test servers for this*
2. Turn on PostgreSQL, HW, application
monitoring
all monitoring should start at the same time
3. Run the test for a defined time
1 hour is usually good
4. Collect and analyze data
5. Re-run at higher level of traffic
34. Test Servers
● Must be as close as reasonable to production
servers
● otherwise you don't know how production will be
different
● there is no predictable multiplier
● Double them up as your development/staging
or failover servers
● If your test server is much smaller, then you
need to do a same-load comparison
36. Production Test
1. Determine the peak load hour on the
production servers
2. Turn on lots of monitoring during
that peak load hour
3. Analyze results
Pretty much your only choice without a test
server.
37. Issues with Production Test
● Not repeatable
−load won't be exactly the same ever again
● Cannot test target load
−just whatever happens to occur during that hour
−can't test incremental increases either
● Monitoring may hurt production performance
● Cannot test experimental changes
38. The Ad-Hoc Test
● Get 10 to 50 coworkers to open several
sessions each
● Have them go crazy on using the application
39. Problems with Ad-Hoc Testing
● Not repeatable
● minor changes in response times may be due to
changes in worker activity
● Labor intensive
● each test run shuts down the office
● Can't reach target levels of load
● unless you have a lot of coworkers
40. Seige
● HTTP traffic generator
● all test interfaces must be addressable as URLs
● useless for non-web applications
● Simple to use
● create a simple load test in a few hours
● Tests the whole web application
● cannot test database separately
● http://www.joedog.org/index/siege-home
41. pgReplay
● Replays your activity logs at variable speed
● get exactly the traffic you get in production
● Good for testing just the database server
● Can take time to set up
● need database snapshot, collect activity logs
● must already have production traffic
● http://pgreplay.projects.postgresql.org/
42. tsung
● Generic load generator in erlang
● a load testing kit rather than a tool
● Generate a tsung file from your actvity logs using
pgFouine and test the database
● Generate load for a web application using custom
scripts
● Can be time consuming to set up
● but highly configurable and advanced
● very scalable - cluster of load testing clients
● http://tsung.erlang-projects.org/
43. pgBench
● Simple micro-benchmark
● not like any real application
● Version 9.0 adds multi-threading, customization
● write custom pgBench scripts
● run against real database
● Fairly ad-hoc compared to other tools
● but easy to set up
● ships with PostgreSQL
44. Benchmarks
● Many “real” benchmarks available
● DBT2, EAstress, CrashMe, DBT5, DBMonster, etc.
● Useful for testing your hardware
● not useful for testing your application
● Often time-consuming and complex
45. Platform-specific
● Web framework or platform tests
● Rails: ActionController::PerformanceTest
● J2EE: OpenDemand, Grinder, many more
– JBoss, BEA have their own tools
● Zend Framework Performance Test
● Useful for testing specific application
performance
● such as performance of specific features, modules
● Not all platforms have them
49. monitoring application during
load test
● Collect response times
● with timestamp
● with activity
● Monitor hardware and utilization
● activity
● memory & CPU usage
● Record errors & timeouts
51. Checking Hardware
● CPUs and Cores
● RAM
● I/O & disk support
● Network
52. CPUs and Cores
● Pretty simple: ● Rules of thumb
● number ● fewer faster CPUs is
● type usually better than
more slower ones
● speed ● core != cpu
● L1/L2 cache ● thread != core
● virtual core != core
53. CPU calculations
● ½ to 1 core for OS
● ½ to 1 core for software raid or ZFS
● 1 core for postmaster and bgwriter
● 1 core per:
● DW: 1 to 3 concurrent users
● OLTP: 10 to 50 concurrent users
● Web: 100 to 1000 concurrent users
55. in praise of sar
● collects data about all aspects of HW usage
● available on most OSes
● but output is slightly different
● easiest tool for collecting basic information
● often enough for server-checking purposes
● BUT: does not report all data on all platforms
56. sar
CPUs: sar -P ALL and sar -u
Memory: sar -r and sar -R
I/O: sar -b and sar -d
network: sar -n
57. sar CPU output
Linux
06:05:01 AM CPU %user %nice %system %iowait %steal %idle
06:15:01 AM all 14.26 0.09 6.01 1.32 0.00 78.32
06:15:01 AM 0 14.26 0.09 6.01 1.32 0.00 78.32
Solaris
15:08:56 %usr %sys %wio %idle
15:09:26 10 5 0 85
15:09:56 9 7 0 84
15:10:26 15 6 0 80
15:10:56 14 7 0 79
15:11:26 15 5 0 80
15:11:56 14 5 0 81
58. Memory
● Only one statistic: how much?
● Not generally an issue on its own
● low memory can cause more I/O
● low memory can cause more CPU time
59. memory sizing
Shared Filesystem work_mem
Buffers Cache maint_mem
In Buffer
In Cache
On Disk
60. Figure out Memory Sizing
● What is the active portion of your database?
● i.e. gets queried frequently
● How large is it?
● Where does it fit into the size categories?
● How large is the inactive portion of your
database?
● how frequently does it get hit? (remember backups)
61. Memory Sizing
● Other needs for RAM – work_mem:
● sorts and aggregates: do you do a lot of big ones?
● GIN/GiST indexes: these can be huge
● hashes: for joins and aggregates
● VACUUM
62. I/O Considerations
● Throughput
● how fast can you get data off disk?
● Latency
● how long does it take to respond to requests?
● Seek Time
● how long does it take to find random disk pages?
63. I/O Considerations
● Throughput
● important for large databases
● important for bulk loads
● Latency
● huge effect on small writes & reads
● not so much on large scans
● Seek Time
● important for small writes & reads
● very important for index lookups
64. I/O Considerations
● Web
● concerned about read latency & seek time
● OLTP
● concerned about write latency & seek time
● DW/BI
● concerned about throughput & seek time
67. Hardware RAID Sanity Check
● RAID 1 / 10, not 5
● Battery-backed write cache?
● otherwise, turn write cache off
● SATA < SCSI/SAS
● about ½ real throughput
● Enough drives?
● 4-14 for OLTP application
● 8-48 for DW/BI
68. Sw RAID / ZFS Sanity Check
● Enough CPUs?
● will need one for the RAID
● Enough disks?
● same as hardware raid
● Extra configuration?
● caching
● block size
69. NAS/SAN Sanity Check
● Check latency!
● Check real throughput
● drivers often a problem
● Enough network bandwidth?
● multipath or fiber required to get HW RAID
performance
70. SSD Sanity Check
● 1 SSD = 4 Drives
● relative performance
● Check write cache configuration
● make sure data is safe
● Test real throughput, seek times
● drivers often a problem
● Research durability stats
72. Network
● Throughput
● not usually an issue, except:
– iSCSI / NAS / SAN
– ELT & Bulk Load Processes
● remember that gigabit is only 100MB/s!
● Latency
● real issue for Web / OLTP
● consider putting app ↔ database on private
network
74. Just like real HW, except ...
● Low ceiling on #cpus, RAM
● Virtual Core < Real Core
● “CPU Stealing”
● last-generation hardware
● calculate 50% more cores
75. Cloud I/O Hell
● I/O tends to be very slow, erratic
● comparable to a USB thumb drive
● horrible latency, up to ½ second
● erratic, speeds go up and down
● RAID together several volumes on EBS
● use asynchronous commit
– or at least commit_siblings
76. #1 Cloud Rule
If your database
doesn't fit in RAM,
don't host it
on a public cloud
78. OS Basics
● Use recent versions
● large performance, scaling improvements in Linux &
Solaris in last 2 years
● Check OS tuning advice for databases
● advice for Oracle is usually good for PostgreSQL
● Keep up with information about issues &
patches
● frequently specific releases have major issues
● especially check HW drivers
79. OS Basics
● Use Linux, BSD or Solaris!
● Windows has poor performance and weak
diagnostic tools
● OSX is optimized for desktop and has poor
hardware support
● AIX and HPUX require expertise just to install, and
lack tools
80. Filesystem Layout
● One array / one big pool
● Two arrays / partitions
● OS and transaction log
● Database
● Three arrays
● OS & stats file
● Transaction log
● Database
81. Linux Tuning
● XFS > Ext3 (but not that much)
● Ext3 Tuning: data=writeback,noatime,nodiratime
● XFS Tuning: noatime,nodiratime
– for transaction log: nobarrier
● “deadline” I/O scheduler
● Increase SHMMAX and SHMALL
● to ½ of RAM
● Cluster filesystems also a possibility
● OCFS, RHCFS
82. Solaris Tuning
● Use ZFS
● no advantage to UFS anymore
● mixed filesystems causes caching issues
● set recordsize
– 8K small databases
– 128K large databases
– check for throughput/latency issues
83. Solaris Tuning
● Set OS parameters via “projects”
● For all databases:
● project.max-shm-memory=(priv,12GB,deny)
● For high-connection databases:
● use libumem
● project.max-shm-ids=(priv,32768,deny)
● project.max-sem-ids=(priv,4096,deny)
● project.max-msg-ids=(priv,4096,deny)
84. FreeBSD Tuning
● ZFS: same as Solaris
● definite win for very large databases
● not so much for small databases
● Other tuning per docs
93. How much recoverability do
you need?
● None:
● fsync=off
● full_page_writes=off
● consider using ramdrive
● Some Loss OK
● synchronous_commit = off
● wal_buffers = 16MB to 32MB
● Data integrity critical
● keep everything on
99. Database Unit Tests
● You need them!
● you will be changing database objects and rewriting
queries
● find bugs in testing or in testing … or in production
● Various tools
● pgTAP
● Framework-level tests
– Rails, Django, Catalyst, JBoss, etc.
103. The Funnel
Application
Middleware
PostgreSQL
OS
HW
104. Check PostgreSQL Drivers
● Does the driver version match the PostgreSQL
version?
● Have you applied all updates?
● Are you using the best driver?
● There are several Python, C++ drivers
● Don't use ODBC if you can avoid it.
● Does the driver support cached plans & binary
data?
● If so, are they being used?
106. Check Caching
● Does the application use data caching?
● what kind?
● could it be used more?
● what is the cache invalidation strategy?
● is there protection from “cache refresh storms”?
● Does the application use HTTP caching?
● could they be using it more?
107. Check Connection Pooling
● Is the application using connection pooling?
● all web applications should, and most OLTP
● external or built into the application server?
● Is it configured correctly?
● max. efficiency: transaction / statement mode
● make sure timeouts match
108. Check Query Design
● PostgreSQL does better with fewer, bigger
statements
● Check for common query mistakes
● joins in the application layer
● pulling too much data and discarding it
● huge OFFSETs
● unanchored text searches
109. Check Transaction
Management
● Are transactions being used for loops?
● batches of inserts or updates can be 75% faster if
wrapped in a transaction
● Are transactions aborted properly?
● on error
● on timeout
● transactions being held open while non-database
activity runs
111. Check for them,
monitor for them
● ailments could throw off your response time
targets
● database could even “hit the wall”
● check for them during health check
● and during each checkup
● add daily/continuous monitors for them
● Nagios check_postgres.pl has checks for many of
these things
112. Database Growth
● Checkup:
● check both total database size and largest table(s)
size daily or weekly
● Symptoms:
● database grows faster than expected
● some tables grow continuously and rapidly
113. Database Growth
● Caused By:
● faster than expected increase in usage
● “append forever” tables
● Database Bloat
● Leads to:
● slower seq scans and index scans
● swapping & temp files
● slower backups
114. Database Growth
● Treatment:
● check for Bloat
● find largest tables and make them smaller
– expire data
– partitioning
● horizontal scaling (if possible)
● get better storage & more RAM, sooner
116. Database Bloat
● Caused by:
● Autovacuum not keeping up
– or not enough manual vacuum
– often on specific tables only
● FSM set wrong (before 8.4)
● Idle In Transaction
● Leads To:
● slow response times
● unpredictable response times
● heavy I/O
117. Database Bloat
● Treatment:
● make autovacuum more aggressive
– on specific tables with bloat
● fix FSM_relations/FSM_pages
● check when tables are getting vacuumed
● check for Idle In Transaction
119. Memory Usage Growth
● Caused by:
● Database Growth or Bloat
● work_mem limit too high
● bad queries
● Leads To:
● database out of cache
– slow response times
● OOM Errors (OOM Killer)
120. Memory Usage Growth
● Treatment
● Look at ways to shrink queries, DB
– partitioning
– data expiration
● lower work_mem limit
● refactor bad queries
● Or just buy more RAM
121. Idle Connections
select datname, usename, count(*) from
pg_stat_activity where current_query =
'<IDLE>' group by datname, usename;
datname | usename | count
---------+---------+-------
track | www | 318
122. Idle Connections
● Caused by:
● poor session management in application
● wrong connection pool settings
● Leads to:
● memory usage for connections
● slower response times
● out-of-connections at peak load
123. Idle Connections
● Treatment:
● refactor application
● reconfigure connection pool
– or add one
124. Idle In Transaction
select datname, usename, max(now() - xact_start) as
max_time, count(*) from pg_stat_activity where
current_query ~* '<IDLE> in transaction' group by
datname, usename;
datname | usename | max_time | count
---------+----------+---------------+-------
track | admin | 00:00:00.0217 | 1
track | www | 01:03:06.0709 | 7
125. Idle In Transaction
● Caused by:
● poor transaction control by application
● abandoned sessions not being terminated fast
enough
● Leads To:
● locking problems
● database bloat
● out of connections
126. Idle In Transaction
● Treatment
● refactor application
● change driver/ORM settings for transactions
● change session timeouts & keepalives on pool,
driver, database
127. Longer Running Queries
● Detection:
● log slow queries to PostgreSQL log
● do daily or weekly report (pgfouine)
● Symptoms:
● number of long-running queries in log increasing
● slowest queries getting slower
131. Too Many Queries
● Caused By:
● joins in middleware
● not caching
● poll cycles without delays
● other application code issues
● Leads To:
● out-of-CPU
● out-of-connections
132. Too Many Queries
● Treatment:
● characterize queries using logging
● refactor application
133. Locking
● Detection:
● log_lock_waits
● scan activity log for deadlock warnings
● query pg_stat_activity and pg_locks
● Symptoms:
● deadlock error messages
● number and time of lock_waits getting larger
135. Locking
● Treatment
● analyze locks
● refactor operations taking locks
– establish a canonical order of updates for long
transactions
– use pessimistic locks with NOWAIT
● rely on cascade for FK updates
– not on middleware code
136. Temp File Usage
● Detection:
● log_temp_files = 100kB
● scan logs for temp files weekly or daily
● Symptoms:
● temp file usage getting more frequent
● queries using temp files getting longer
137. Temp File Usage
● Caused by:
● Sorts, hashes & aggregates too big for work_mem
● Leads to:
● slow response times
● timeouts
138. Temp File Usage
● Treatment
● find swapping queries via logs
● set work_mem higher for that ROLE, or
● refactor them to need less memory, or
● buy more RAM