1. A lesson on Statistics:
Data – Types, description and
interpretation
Dr Andrea Josephine R,
2nd year MD PG,
Department of Pediatrics,
ESIC Medical College & PGIMSR, Chennai.
2. Topics
Types of data
Measures of central tendency
Measures of dispersion
Measures of distribution
Characterizing diagnostic tests – The test of a test
3. Types of data
1. Nominal:
Qualitative data
Characteristics of a variable – Categories
Mutually exclusive, exhaustive
No implied order
E.g. Sex : Male/Female, Demographics
(Urban/Suburban/Rural)
4. Types of data
2. Ordinal:
Qualitative data – Categories
Rank/Order into a progression, mutually exclusive,
exhaustive
Size of the interval not measurable or equal
E.g. Satisfaction with treatment – Very satisfied /
Somewhat satisfied / Somewhat dissatisfied / Very
dissatisfied
5. Types of data
3. Interval:
Quantitative data
Meaningful intervals
No absolute zero
Ratio between 2 measurements not meaningful
E.g. Temperature scale: In degrees Celsius, difference
between 2 measurements quantifiable, but ratio not
meaningful; 0⁰C does not imply a total absence of heat
6. Types of data
4. Ratio:
Quantitative data
Absolute zero
Meaningful ratios
E.g. Age (years), Weight(kg), Blood pressure(mmHg)
7. Types of data
1. Discrete:
Only whole numbers possible / distinct categories
E.g. Number of patients, number of syringes used, Gender,
hair colour
2. Continuous:
Any value in a continuum
E.g. Weight, Height, Serum creatinine
8. Measures of central tendency
1. Mean:
Used for interval & ratio data
Summation of all values divided by number of values in
the sample
x = Ʃx
n
9. Measures of central tendency
2. Median:
Used for ordinal data
Half of the values lie above it, half below it
If n is odd, arrange in order: Middle value = median
If n is even, arrange and take mean of middle 2 values
10. Measures of central tendency
3. Mode:
Used for nominal data
Most frequently appearing category
If 2 categories appear equally, bimodal
Can be multimodal
11. Measures of dispersion
1. Range:
Difference between highest and lowest values
E.g. A set of values 102, 105, 109, 111 and 120. Range is
not 102-120. Range = 120-102 = 18.
12. Measures of dispersion
2. Interquartile range:
Range of the middle 50% of the data
Difference between the upper and lower quartile
13. Measures of dispersion
3. Mean deviation:
Average of the absolute deviations from mean.
Mean deviation = Ʃ ǀ x – x ǀ
n
14. Example
Mean Deviation of 3, 6, 6, 7, 8, 11, 15, 16
Step 1: Find the mean: (3 + 6 + 6 + 7 + 8 + 11 + 15 + 16)/8
= 72/8 = 9
Step 2: Find the distance of each value from that mean:
15. Example (Contd.)
Step 3. Find the mean of those distances:
Mean Deviation = (6 + 3 + 3 + 2 + 1 + 2 + 6 + 7)/8 = 30/8 =
3.75
So, the mean = 9, and the mean deviation = 3.75
3.75 away from the middle
Why take absolute value?
16. Measures of dispersion
4. Variance and Standard deviation:
Variance(s2) = Mean of the squares of the deviation
= Ʃ (x – x )2
n
Standard deviation(SD) = Ʃ (x – x)2
√ n
Smaller value of SD Closer the values cluster around
the mean
If a constant is added to all values, mean changes; Variance
and SD remain the same.
17. Measures of dispersion
5. Coefficient of variation(CV):
CV = SD/Mean
The units of SD and mean are same, hence CV is an
independent value.
If both SD and mean are multiplied by a constant, CV
remains the same (Useful in ratio measurements).
Not useful in interval level data, as CV decreases with
addition of a constant to each value.
18. Skewness
Refers to the symmetry of the frequency-distribution
curve.
Value of 0 – Unskewed, Positive value – skewed to the
right, Negative value – skewed to the left.
Refers to the side of the longer tail, NOT that of the bulk
of the data.
19. Kurtosis
Refers to the peak of the frequency-distribution curve.
Mesokurtosis – Normal distribution curve
Leptokurtosis – Peaked; Platykurtosis - Flat
20. Sensitivity
Ability of a test to correctly identify patients with disease
Sensitivity = True positives
True positives + False negatives
Patients
picked up, 80
Undiagnosed
diseased
population, 20
Sensitivity
21. Specificity
Ability of the test to correctly identify patients who are
disease-free/healthy
Specificity = True negatives
True negatives + False positives
Healthy,
80
Healthy
mis-
labelled
diseased, 20
Specificity
22. Positive predictive value
Proportion of patients with positive test results who truly have
disease.
PPV = True positive
True positive + False positive
Answers the question: “I have tested
positive. Am I really diseased?”
Truly
diseased
80%
Healthy
mislabelled
diseased
20%
PPV
23. Negative predictive value
Proportion of patients with negative test results who are
truly disease-free
NPV = True negatives
True negatives + False negatives
Answers the question: “I have tested
negative. Am I really disease-free?”
Truly
healthy
80%
Diseased
mis-
labelled
healthy
20%
NPV
24. PPV and NPV
Highly dependent on the prevalence of a disease in a
given population.
Less reliable in rare diseases.
Less transferable from one population to another.
25. Likelihood ratio
Combines sensitivity and specificity
Positive likelihood ratio defines the extent to which a
positive test result increases the likelihood of having
disease.
LR+ = Sensitivity
1 – Specificity
If the LR + of a test is 1.36, a patient who tests positive is
1.36 times more likely to have the disease than a patient
who tests negative.
26. LR – Interpretation:
LR+ over 5 - 10: Significantly increases likelihood of the
disease
LR+ between 0.2 to 5 (esp if close to 1): Does not modify
the likelihood of the disease
LR+ below 0.1 - 0.2: Significantly decreases the likelihood
of the disease
27. Likelihood ratio
Negative likelihood ratio defines the extent to which a
negative test result decreases the likelihood of having
disease.
LR- = 1 – sensitivity
Specificity
If LR- of a test is 1.5, it means a patient with a negative
test result is 1.5 times more likely to be disease-free than
a patient with a positive test result.
28. LR – Points:
Independent of disease prevalence
Specific to the test being used
Can be applied to the individual patient to evaluate how
worthwhile it is to perform a given test
29. Receiver Operator Characteristics Curve
If the cut-off for a test is raised, both true and
false positive rate would decrease.
True positive rate = Sensitivity
False positive rate = 1 – Specificity.
A graph between the 2 is ROC curve.
30. ROC curve – Area under curve
Area under curve – Used to assess overall accuracy of a
test
Value of 1 – High sensitivity and specificity
Value of 0.5 – Zero diagnostic capability, Line
of zero discrimination, no better than tossing
a coin.
31. Using ROC curve and AUC to choose
between tests
ROC curves:
32. References
Biostatistics: The bare essentials, 3e by Norman and Streiner
Health services research methods, 2e by Leiyu Shi
Bewick V, Cheek L, Ball J. Statistics review 13: Receiver operating
characteristic curves. Crit Care. 2004;8(6):508-512.
AG Lalkhen, A McCluskey. Clinical tests: sensitivity and specificity. Contin Educ
Anaesth Crit Care Pain (2008) 8 (6): 221-223.