3. A very important SQL
MERGE INTO orders_table USING dual
ON (dual.dummy IS NOT NULL AND id = :1 AND p_id = :2
AND order_id = :3 AND relevance = :4 AND …
Typical elapsed time: 100 ms
*Bad* elapsed time: > 200 ms
5. SQL latency metrics
Elapsed
Elapsed Time
Time (s)
Executions per Exec (s) %Total
%CPU
%IO
SQL Id
---------------- -------------- ------------- ------ ------ ------ ------------635.5
10,090
0.1
31.5
16.5
77.6 fskp2vz7qrza2
Module: MYmodule
merge into orders_table using dual on (dual.dummy is not null and id = :1
and p_id = :2 and order_id = :3 and relevance = :4 and …
25. Application side tracing
start_exec = time()
App
Exec: 4fucahsywt13m:19731969
Elapsed = time() – start_exec
o “True” user experience
o Precise
(captures “everything”)
o (Lots of)
DIY by developers
o Captures
*not only* db time
Db
26. Server side (10046) tracing
start_exec = time()
App
Exec: 4fucahsywt13m:19731969
Db
Elapsed = time() – start_exec
o Precise
(captures “everything”)
o Detailed:
breakdown by events and
SQL “stages”
o Cumbersome to process
(lots of individual trace
files and “events”)
27. Sampling
• v$sql.elapsed_time
Executions
Elapsed Time
CPU Time
IO Time
App Time
58825
298,986,074
20,326,883
279,055,026 5,635
Executions
Elapsed Time
CPU Time
IO Time
58826
299,003,156
20,327,883
279,071,108 5,635
Executions
Elapsed Time
CPU Time
IO Time
App Time
1
17,082
1,000
16,082
0
App Time
28. Sampling
with number_generator as (
select level as l from dual connect by level <= 1000
), target_sqls as (
select /*+ ordered no_merge use_nl(s) */
…
from number_generator i, gv$sql s
34. Sampling
with i_gen as (
select level as l from dual
connect by level <= &REPS
), target_sqls as (
select
/*+ ordered
no_merge use_nl(s) */
…
from i_gen i, gv$sql s
o SQL access to data
o Simplified time breakdown
o Can capture “hours”
o Slightly imprecise
(captures 90-95 % of runs)
o x$ data: “suspect” ?
39. Takeaways
• Percentiles
are better performance metrics than averages
• Percentile calculation:
requires capturing (most of) individual SQL runs
• A number of ways exist to capture and measure
individual SQL runs
Latency = “elapsed time”How to monitor performance:Define the goal (or SLA)Choose a good metricMeasureFind and report problems
AWR reports are looking dba_hist_sqlstat.elapsed_time, which is, in turn, looking at v$sql.elapsed_timeSo, what can we judge from “average” ? How typical is it ? What is the probability that it is *much bigger* ?
wikipedia: Normal distributions are … often used in the natural and social sciences for real-valued random variables whose distributions are not known.[1][2]
“Time frequency” distribution
Based on my samples, I really want to say: “Typically it’s not normal”, but to be conservative, let me just say: “it’s possible it’s not normal”
A slight adjustment for “people feel variance, not the mean” maxima.
Percentiles:Order all executions by elapsed timeSelect the last N %
Percentiles are usually defined by the lower edge
Super helpful: send identifier strings along with your data: module, client_id, ECID etc
Server side tracing is often complementary to client side tracing: i.e. it allows to confirm whether or not client side latency is *caused* by the database (as opposed to other factors: network, app machine etc)
Anything in v$sql/v$session can be captured, i.e. machine, current object etcI found that v$session.prev_sql_addr and v$session.prev_exec_id are pretty accurate
Anything in v$sql/v$session can be captured, i.e. machine, current object etcI found that v$session.prev_sql_addr and v$session.prev_exec_id are pretty accurate
ASH measures “events”, not overall sql elapsed timeFor short duration “events” (i.e. “db file sequential read”) TIME_WAITED has no correlation to (overall) sql elapsed time (as, presumably, there can be multiple such events)
Even though probability of capturing an event gets bigger as wait time gets longer (reaching 100% for >1 second waits), there is, typically, *a lot* more “short running” events than long running.As a result, long running events are completely “swamped” and it is not possible to determine whether sql was long running simply by the fact that its event was recorded in ASH.