Transaction Management in Database Management System
3Measurements of health and disease_MCTD.pdf
1. Data summarization
Measures of central tendency and
Dispersion [MCTD]
1/2/2023 Data summarization 1
Emiru Merdassa(MSc, Assistant Professor)
2. Learning Objectives
By the end of this session, the students will be able to compute and interpret
Mean
Median
Mode
o Range (R)
o Variance and Standard deviation.
o Coefficient of variation (C.V)
o Interquartile Range
1/2/2023 Data summarization 2
3. Introduction
• Compiling and presenting the data in tabular or graphical form
will not give complete information of the data collected.
• We need to “summarize” the entire data in one figure, looking
at which we can get overall idea of the data.
• Summary measures provide description of data in terms of
concentration of data and variability existing in data.
• We use these summary figures to draw certain conclusions about
the reference population from which the sample data has been
drawn.
1/2/2023 Data summarization 3
4. I. Arithmetic Mean
• It is the average of the data.
• Random sample of size 10 of ages, where
ҧ
𝑥 =
42 + 28 + 28 + 61 + 31 + 23 + 50 + 38 + 32 + 37
10
1/2/2023 Data summarization 4
n
X
X
n
i
i
=
= 1
ഥ
X =
370
10
= 𝟑𝟕
5. Properties of the Mean
o Uniqueness: For a given set of data there is one and only one mean.
o Simplicity: It is easy to understand and to compute.
o Affected by extreme values: since all values enter into the
computation.
Example:
Assume the values are 115, 110, 119,117,121 and 126. The mean =
118. But assume that the values are 75, 75, 80, 80 and 280. The mean
= 118, a value that is not representative of the set of data as a whole.
1/2/2023 Data summarization 5
6. Median
1/2/2023 Data summarization 6
It is the middle value in the ordering of all data values
from smallest to largest.
• For the same random sample, the ordered observations will 23,
28, 28, 31, 32, 34, 37, 42, 50, 61.
• Since n = 10, then the median is the 5.5𝑡ℎobservation, i.e. =
(32+34)/2 = 33
7. …Median
Properties of the Median:
• Uniqueness: For a given set of data there is one and
only one median.
• Simplicity: It is easy to calculate.
• It is not affected by extreme values as is the mean.
1/2/2023 Data summarization 7
8. Mode
• It is the value which occurs most frequently.
• If all values are different there is no mode.
• Sometimes, there are more than one mode.
Sample:
• For the same random sample, the value 28 is repeated two times, so
it is the mode.
Properties of the Mode
• Sometimes, it is not unique.
• It may be used for describing qualitative data.
1/2/2023 Data summarization 8
9. Exercises
Calculate
1) Arithmetic Mean
2) Median,
3) Mode,
4) Range,
5) IQR and
6) Standard Deviation using the following data
9
Ages of Women in Clinic
23 31 55 43 55 19 17 44 43 37
1/2/2023 Data summarization
11. Measures of Spread…
• Measures of spread are :
o Range (R).
o Variance and Standard deviation.
o Coefficient of variation (C.V).
o Interquartile Range
• Measures of Relative Position(Quantiles and
Percentiles)
1/2/2023 Data summarization 11
12. Introduction
• Knowledge of central tendency alone is not sufficient for
complete understanding of distribution.
• Measures of spread tell us how far or how close together
the data points are in a sample.
• Measures of variability are measures of spread that tell us
how varied our data points are from the average of the
sample.
1/2/2023 Data summarization 12
13. Range (R)
Range = Largest value - Smallest value
Note:
o Range concern only onto two values
o Highly sensitive to outliers
o Data: 43, 66, 61, 64, 65, 38, 59, 57, 57, 50.
o Find Range? Range=66-38=28
1/2/2023 Data summarization 13
14. Variance
• It measure dispersion relative to the scatter of the values
about their mean,
a) Sample Variance(S2 ):
,where ത
X is sample mean
• Find Sample Variance of ages, ҧ
𝑥= 56
Solution:
S2 = [ (43 − 56)2+(66 − 56)2+ ⋯ +(50−56)2]/ 10-1
= 810/9 = 90
−
= −
=
n
i n
i x
x
s 1
2
2
1
)
(
1/2/2023 Data summarization 14
15. Standard Deviation
• It is the square root of variance ( Variance )
a) Sample Standard Deviation(SD)
= S2
b) Population Standard Deviation(𝜎)
= 𝜎2
1/2/2023 Data summarization 15
17. Measures of Dispersion…
Consider the following two sets of data:
A: 177 193 195 209 226 Mean = 200
B: 192 197 200 202 209 Mean = 200
Two or more sets may have the same mean and/or median but
they may be quite different.
1/2/2023 Data summarization 17
18. Measures of Dispersion…
A measure of dispersion conveys information regarding the
amount of variability present in a set of data,
Note:
1. If all the values are the same: There is no dispersion ,
2. If all the values are different: There is a dispersion:
3. If the values close to each other: The amount of Dispersion is
small.
4. If the values are widely scattered: The Dispersion is greater.
1/2/2023 Data summarization 18
19. Standard deviation
• Caution must be exercised when using standard deviation as a
comparative index of dispersion
Weights of newborn
elephants (Kg)
929 553
878 939
895 972
937 841
801 826
Weights of newborn
mice (Kg)
0.72 0.42
0.53 0.31
0.59 0.38
0.79 0.96
1.06 0.89
n = 10
ഥ
𝑿= 887.1
SD = 56.50
n = 10
ഥ
𝑿 = 0.68
SD = 0.255
• Incorrect to say that elephants show greater variation for birth-
weights than mice because of higher standard deviation
1/2/2023 Data summarization 19
20. The Coefficient of Variation (C.V)
• Is a measure use to compare the dispersion in two
sets of data which is independent of the unit of
the measurement.
CV =
SD
ഥ
X
*100;
Where
S: Sample standard deviation.
ത
X: Sample mean.
1/2/2023 Data summarization 20
21. Coefficient of Variance
• Coefficient of variance expresses standard deviation relative to
its mean
Weights of newborn
elephants (Kg)
929 553
878 939
895 972
937 841
801 826
Weights of newborn
mice (Kg)
0.72 0.42
0.53 0.31
0.59 0.38
0.79 0.96
1.06 0.89
n = 10
ഥ
𝑿 = 887.1
SD = 56.50
CV = 0.0637
n = 10
ഥ
𝑿 = 0.68
SD = 0.255
CV = 0.375
Note :
Mice show greater birth weight variation
1/2/2023 Data summarization 21
22. Example:
• Suppose two samples of human males yield the following data:
We wish to know which is more variable.
Solution:
C.V (Sample 1) = (10/145)*100= 6.9
C.V (Sample 2) = (10/80)* 100= 12.5
• Then age of 11-years olds(sample 2) is more variation
Sample 1 Sample 2
Age 25-year-olds 11 year-olds
Mean weight 145 pound 80 pound
Standard Deviation 10 pound 10 pound
1/2/2023 Data summarization 22
23. When to use coefficient of variance
o When comparison groups have very different means
o When different units of measurement are involved, e.g. group 1
unit is mm, and group 2 unit is gm (CV is suitable for comparison
as it is unit free)
o In such cases, SD should not be used for comparison
1/2/2023 Data summarization 23
24. Measures of Relative Position
Locate the relative position of an observation in relation to the
other observations.
Divide the data set into 100 equal groups
Suppose a data set is arranged in ascending (or descending )
order. The pth percentile is a number such that p% of the
observations of the data set fall below and (100-p)% of the
observations fall above it. For Example
45 % of observations are below the 45th percentile
55 % of observations are above 45th percentile
1/2/2023
24
Data summarization
26. Percentile
1/2/2023 Data presentation & summarization 26
Data: 13, 11, 10, 13; 11, 10, 8, 12, 9, 9, 8, 9
What is the percentile rank for12?
Solution: First, we need to arrange the values from smallest to largest.
This ordered set is given as: 8, 8, 9, 9, 9, 10, 10, 11, 11, 12, 13, 13
Observe that the number of values below 12 is 9 and the total number of values in
the data set is 12. Thus, using the formula, the corresponding percentile is
That is, the value of 12corresponds to approximately the
79th percentile
27. Ctd
1. Tertiles:
• Two points that divide and order a sample variable into three
categories, each containing a third of the population
(e.g., high, medium, low).
1/2/2023 Data summarization 27
28. 2. Quartiles:
• Three points that divide and order a sample variable into four
categories, each containing a fourth of the population.
• The 25th, 50th, and 75th percentiles of a variable are used to
categorize it into quartiles.
3. Quintiles:
• Four points that divide and order a sample variable into five
categories, each containing a fifth of the population.
• The 20th, 40th, 60th, and 80th percentiles of a variable are used to
categorize it into quintiles.
1/2/2023 Data summarization 28
29. Ctd
4. Deciles:
• Nine points that divide and order a sample variable
into ten categories, each containing a tenth of the
population.
• The 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and
90th percentiles of a variable are used to categorize
it into deciles
1/2/2023 Data summarization 29
30. Quartile
1/2/2023 Data presentation & summarization 30
• Quartiles are the values that divide a list of numbers into quarters:
• Put the list of numbers in order
• Then cut the list into four equal parts
• Example: 5, 7, 4, 4, 6, 2, 8
• Put them in order: 2, 4, 4, 5, 6, 7, 8
• Cut the list into quarters:
– Quartile 1 (Q1) = 4
– Quartile 2 (Q2), which is also the Median, = 5
– Quartile 3 (Q3) = 7
31. Interquartile Range
1/2/2023 Data presentation & summarization 31
3rd quartile – 1st quartile
75th – 25th percentile
3(n+1)/4 - (n+1)/4
Robust to outliers
Middle 50% of observations
The Interquartile Range is:
IQR = Q3 − Q1 = 7 − 4 = 3
32. Exercise
The incubation period of smallpox in 9 patients where it was
found to be 14, 13, 11, 15, 10, 7, 9, 12 and 10.
Find:
1. Mean, Median & Mode
2. Recommend the best MCT
3. Range & IQR
4. S2
& SD
5. C.V
1/2/2023 Data summarization 32