UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
Lecture-2 (discriptive statistics).ppt
1. NURSING Dream ● Discover ● Deliver
Lemma Derseh (BSc., MPH)
1
University of Gondar
College of medicine and health science
Department of Epidemiology and
Biostatistics
Descriptive statistics
2. NURSING Dream ● Discover ● Deliver
Statistical Methods (branches of statistics)
collection
organizing
summarizing
presenting of data
Descriptive Statistics
making inferences
hypothesis testing
determining relationship
making the prediction
Inferential Statistics
Biostatistics
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
3. NURSING Dream ● Discover ● Deliver
Descriptive Statistics
1. Involves
– Collecting Data
– Presenting Data
– Characterizing
Data
2. Purpose
– Describe Data
x = 74.5, S2 = 213
0
50
100
1St 2nd 3rd 4th
Class
size
Batch (one department)
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
4. NURSING Dream ● Discover ● Deliver
Descriptive statistics cont…
Types of descriptive statistics
Tables/charts/graphs …………..
Measures of central tendency
Measures of variability
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
Numerical summary
measures
Pictorial measure
5. NURSING Dream ● Discover ● Deliver
Tables/charts/graphs
Tables are used in categorical variables or
categorized numerical data
Tables:
Frequency (for nominal and ordinal data)
Relative frequency (for nominal and ordinal data)
Cumulative frequencies (for ordinal data)
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
The methods of describing data differ depending on the
type of the data itself (i.e. Numerical or Categorical).
6. NURSING Dream ● Discover ● Deliver
Describing categorical variables … cont
Frequency is the number of observations in each category
The relative frequency of a class is the portion or
percentage of the data that falls in that class
E.g. 1: The blood type of 30 patients were given as follows:
A AB B B A O O AB AB B O A A B B A AB A O AB
B AB AB O A AB AB O A O
Construct a table for it
6
Type Frequency Relative frequency
A 8 0.267
B 6 0.20
AB 9 0.30
O 7 0.233
Total 30 1.00
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
7. NURSING Dream ● Discover ● Deliver
Distribution of birth weight of newborns between 1976-1996 at TAH.
BWT Freq. Rel.Freq(%) Cum. Freq Cum.rel.freq.(%)
Very low 43 0.4 43 0.4
Low 793 8.0 836 8.4
Normal 8870 88.9 9706 97.3
Big 268 2.7 9974 100
Total 9974 100
7
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
Cumulative relative frequency is relevant for ordinal data
Consider for example, the variable birth weight with levels
‘Very low ’, ‘Low’, ‘Normal’ and ‘Big’.
The cumulative frequency of a class is the sum of the
frequency for that class and all the previous classes.
8. NURSING Dream ● Discover ● Deliver
Charts
Charts are used only for categorical variables
Bar charts
The successive bars are separated (not continuous)
Pie charts
Each sector of a circle indicates a category of data
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
9. NURSING Dream ● Discover ● Deliver
Charts cont…
Bar Chart
Bar charts: display the frequency distribution for
nominal or ordinal data.
The various categories into which the observation fall
are represented along horizontal axis and
9
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
10. NURSING Dream ● Discover ● Deliver
Fig. 1 Bar chart for blood type of 30 patients
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
11. NURSING Dream ● Discover ● Deliver
Pie cart
Pie chart displays the frequency of nominal or ordinal
variables.
The various categories of the variable will be represented
by the sector of the circle.
The area of each sector is proportional to the frequency
of the corresponding category of the variable
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
12. NURSING Dream ● Discover ● Deliver
Fig. 3. Pie chart showing the frequency distribution of the
variable blood group
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
13. NURSING Dream ● Discover ● Deliver
Categorizing Numeric data
In order to present and organize numeric type of data using tables or
graphs, we need to group the dataset as follows:
Number of class: the number of categories the table will have
Class limit: The range for each class
Lower class limit
Upper class limit
Class boundary: Continuous range of the class limit and it is obtained by
subtracting and adding 0.5 from lower and upper class limit respectively (for
non-decimal data but for decimal 0.05)
Lower class boundary
Upper class boundary
Class mark: The average of lower and upper class limit.
13
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
14. NURSING Dream ● Discover ● Deliver
Struge’s rule
Select a set of continuous, non-overlapping intervals such
that each value in the set of observations can be placed in
one, and only one, of the intervals.
– Where K = number of class intervals
– n = number of observations
– W = width of the class interval
– L = the largest value
– S = the smallest value
14
K 1 3.322(logn)
W
L S
K
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
15. NURSING Dream ● Discover ● Deliver
Struge’s rule cont…
For datasets with integral values subtracted or add 05.from
class limits to find class boundaries
The answer obtained by applying Sturge’s rule should not be
regarded as final, but should be considered as a guide only.
The number of class intervals specified by the rule should be
increased or decreased for convenience and clear presentation
15
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
16. NURSING Dream ● Discover ● Deliver
Example 1
The blood lead level measured in μg/dl for 88 sample
individuals living in a region are given as follows(numbers
with blue color are for females and the black for males)
20,21, 22,22,23,23,23,24,24,24,24,25,25,25,25,25,26,26,26,26,26,27,
27,27,27,27,27,28,28,28,28,28,28,28,28,29,29,29,29,29,30,30,30,30,
30,30,30,30,30,31,31,31,31,31,31,31,32,32,32,32,32,33,33,33,33,33,
33,33,34,34,34,34,35,35,35,35,36,36,36,36,36,37,37,37,37,38,38,39
Construct frequency distribution for the data.
Solution:
16
7
.
2
7
19
7
20
39
K
S
L
W
46
.
7
88)
3.322(log(
1
)
3.322(logn
1
K
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
≈ 3
17. NURSING Dream ● Discover ● Deliver
Solution
Blood lead level
Mi frequency RF CF RCF
Class
Limit
Class
Boundaries
20-22 19.5-22.5 21 4 4/88 4 4/88
23-25 22.5-25.5 24 12 12/88 16 16/88
26-28 25.5-28.5 27 19 19/88 35 35/88
29-31 28.5-31.5 30 21 21/88 56 56/88
32-34 31.5-34.5 33 16 16/88 72 72/88
35-37 34.5-37.5 36 13 13/88 85 85/88
38-40 37.5-40.5 39 3 3/88 88 88/88
17
Where:
RF = relative frequency
Mi = class mark
CF = cumulative frequency
RCF = relative cumulative frequency
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
18. NURSING Dream ● Discover ● Deliver
Graphs
Some examples are:
Histogram,
Frequency polygon,
Cumulative Relative Frequency Curve etc
18
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
19. NURSING Dream ● Discover ● Deliver
Histograms
Histograms are frequency distributions with continuous class
interval that have been turned into graphs.
The area of each column is proportional to the number of
observations in that interval
19
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
20. NURSING Dream ● Discover ● Deliver
Example
The distribution of the blood lead level of 88 individuals
Blood LL No. of Individuals
19.5-22.5 4
22.5-25.5 12
25.5-28.5 19
28.5-31.5 21
31.5-34.5 16
34.5-37.5 13
37.5-40.5 3
20
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
19.5 22.5 25.5 28.5 31.5 34.5 37.5 40.5
Blood lead level
21. NURSING Dream ● Discover ● Deliver
Frequency polygons
Instead of drawing bars for each class interval, sometimes
a single point is drawn at the mid point of each class
interval and consecutive points joined by straight line.
Graphs drawn in this way are called frequency polygons
(line graphs).
21
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
22. NURSING Dream ● Discover ● Deliver
Frequency polygons cont…
Frequency polygon for the blood lead level of study
participants
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
23. NURSING Dream ● Discover ● Deliver
Frequency polygon of blood lead level for
males and females
23
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
Frequency polygons are superior to histograms for
comparing two or more sets of data.
24. NURSING Dream ● Discover ● Deliver
Cumulative frequency curve (ogive)
The horizontal axis displays the different categories/intervals
The vertical axis displays cumulative (relative) frequency.
A point is placed at the true upper limit of each interval; the
height represents the cumulative relative frequency
associated with that interval. The points are then connected
by straight lines.
Like frequency polygons, cumulative frequency curve may be
used to compare sets of data.
Cumulative frequency curve can also be used to obtain
percentiles of a set of data.
24
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
25. NURSING Dream ● Discover ● Deliver
Cumulative frequency curve cont…
Cumulative relative frequency curve for the blood lead
level of study participants
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
Cumulative
frequency
(prportion
of
individuals
)
The graph ends
at the upper
boundary of the
last class.
The graph begins at the lower
boundary of the first class.
26. NURSING Dream ● Discover ● Deliver
Box plots
A visual picture called box (box-and-whisker )plot can be
used to convey a fair amount of information about the
distribution of a set of data.
It is used as an exploratory data analysis tool
The box shows the distance between the first and the
third quartiles,
The median is marked as a line within the box and
The end lines show the minimum and maximum values
respectively
26
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
27. NURSING Dream ● Discover ● Deliver
Box plot is the five-number summary:
The minimum entry
Q1
Q2 (median)
Q3
The maximum entry
Box plots cont…
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
The quartiles are sets of values which divide the distribution
into four parts such that there are an equal number of
observations in each part.
Q1 = [(n+1)/4]th
Q2 = [2(n+1)/4]th
Q3 = [3(n+1)/4]th
28. NURSING Dream ● Discover ● Deliver
Example: Use the following age data of 15 patients to draw
a box-and-whisker plot.
35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
Box plots cont…
Q3
Q2
Q1
Max
Min
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
29. NURSING Dream ● Discover ● Deliver
Illustration of Box-plot using the age of 15 patients
29
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
Notice the
distribution of
data in each
quarter(distance
between
quartiles)
30. NURSING Dream ● Discover ● Deliver
A box-plot indicating the distribution of blood
lead level of individuals by sex
30
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
31. NURSING Dream ● Discover ● Deliver
Measures of central tendency
It is often useful to summarize, in a single number or statistic,
the general location of the data or the point at which the data
tend to cluster.
Such statistics are called measures of location or measures of
central tendency.
We describe them mean, median and mode.
Arithmetic mean
The arithmetic mean, usually abbreviated to ‘mean’ is the sum of
the observations divided by the number of observations.
31
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
32. NURSING Dream ● Discover ● Deliver
Arithmetic Mean
32
.
n
x
=
x
then
,
sample
a
of
values
observed
n
are
x
...,
,
x
,
x
If
n
1
=
i
i
n
2
1
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
a) Ungrouped mean
Population mean: , if x’s are population observations
x
μ
N
92
.
29
88
9)
3
...
22
21
(20
n
x
=
x
88
1
=
i
n
1
=
i
i
Example: Blood lead level for 88 sample individuals
33. NURSING Dream ● Discover ● Deliver
Arithmetic Mean cont…
b) Grouped data
In calculating the mean from grouped data, we assume that
all values falling into a particular class interval are located
at the mid-point of the interval. It is calculated as follow:
where,
k = the number of class intervals
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
33
k
1
=
i
i
k
1
=
i
i
i
f
f
m
=
x
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
34. NURSING Dream ● Discover ● Deliver
Arithmetic Mean cont…
Blood lead
level
( CB)
Class
mark
(Mi)
frequency
19.5-22.5 21 4
22.5-25.5 24 12
25.5-28.5 27 19
28.5-31.5 30 21
31.5-34.5 33 16
34.5-37.5 36 13
37.5-40.5 39 3
86
.
29
)
3
..
.
12
(4
x3)
39
...
24x12
(21x4
=
x 7
1
=
i
7
1
=
i
Example: Arithmetic mean for grouped data of blood
lead level
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
35. NURSING Dream ● Discover ● Deliver
Properties of the arithmetic mean
The mean can be used as a summary measure for both discrete
and continuous data, in general however, it is not appropriate
for either nominal or ordinal data.
For a given set of data there is one and only one arithmetic
mean.
Algebraic sum of the deviations of the given values from their
arithmetic mean is always zero.
The arithmetic mean is greatly affected by the extreme values.
In grouped data if any class interval is open, arithmetic mean
cannot be calculated.
35
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
36. NURSING Dream ● Discover ● Deliver
Median
With the observations arranged in an increasing or decreasing order,
the median is defined as the middle observation.
Ungrouped data
If the number of observations is odd, the median is defined as the
[(n+1)/2]th observation.
If the number of observations is even the median is the average of
the two middle (n/2)th and [(n/2)+1]th values i.e
Example , where n is even: 19, 20, 20, 21, 22, 24, 27, 27, 27, 34
Then, the median = (22 + 24)/2 = 23
The ungrouped median for the blood lead level data is the average
of the 44th & 45th observation; which is (30+30)/2 =30
36
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
37. NURSING Dream ● Discover ● Deliver
Median Cont…
Grouped data
In calculating the median from grouped data, we assume that
the values within a class-interval are evenly distributed
through the interval.
– The first step is to locate the class interval in which it is
located.
– Find n/2 and see a class interval with a minimum
cumulative frequency which contains n/2.
(Note:- All class intervals with cumulative frequencies ≥ n/2
contain the median)
37
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
38. NURSING Dream ● Discover ● Deliver
Median for Grouped data …cont
To find a unique median value, use the following interpolation formal.
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just bellow the median class
interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations
38
W
f
F
2
n
L
=
x
~
m
c
m
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
39. NURSING Dream ● Discover ● Deliver
Median for grouped data cont…
Example
Using the data on the blood lead level of 88 individuals, the
grouped median is:
79
.
29
3
21
35
44
28.5
W
f
F
2
n
L
=
x
~
m
c
m
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
40. NURSING Dream ● Discover ● Deliver
Properties of median
The median can be used as a summary measure for
ordinal, discrete and continuous data, in general
however, it is not appropriate for nominal data.
There is only one median for a given set of data
Median is a positional average and hence it is not
drastically affected by extreme values (It is robust or
resistant to extreme values)
Median can be calculated even in the case of open end
intervals
It is not a good representative of data if the number of
items is small
40
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
41. NURSING Dream ● Discover ● Deliver
Mode
Any observation of a variable at which the distribution reaches a
peak is called a mode.
Most distributions encountered in practice have one peak and
are described as uni-modal.
E.g. Consider the example of ten numbers
19 21 20 20 34 22 24 27 27 27
In the above data set, the mode is 27
The mode of grouped data, usually refers to the modal class,
(the class interval with the highest frequency)
If a single value for the mode of grouped data must be
specified, it is taken as the mid point of the modal class interval
41
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
42. NURSING Dream ● Discover ● Deliver
Properties of mode
The mode can be used as a summary measure for
nominal, ordinal, discrete and continuous data, in general
however, it is more appropriate for nominal and ordinal
data.
It is not affected by extreme values
It can be calculated for distributions with open end classes
Sometimes its value is not unique
The main drawback of mode is that it may not exist
42
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
43. NURSING Dream ● Discover ● Deliver
Measures of variability (Dispersion)
In order to fully understand the nature of the distribution of data set,
both measures of location and dispersion are important
Some measures of variability are: range, inter-quartile range,
variance, standard deviation and the coefficient of variation.
Range:
The range is the difference between the largest and the smallest
observations in the data set.
Being determined by only the two extreme observations, use of the
range is limited because it tells us nothing about how the data
between the extremes are spread.
Example1 : We use the data set of 10 numbers:
19 , 21,20, 20, 34, 22, 24, 27, 27, 27
The range = 34 – 19 = 15
43
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
44. NURSING Dream ● Discover ● Deliver
Quartiles and Inter-quartile Range, Percentiles
• The inter-quartile range (IQR) is the difference between the
third and the first quartiles.
Q3 – Q1
• Example: Consider the age data of 15 patients to find IQR
• IQR = 48 – 37 = 11
44
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
Q3
Q2
Q1
45. NURSING Dream ● Discover ● Deliver
Quartiles and Inter-quartile Range, Percentiles
Percentiles divide the data into 100 parts of observations in
each part.
It follows that the 25th percentile is the first quartile, the 50th
percentile is the median and the 75th percentile is the third
quartile.
45
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
46. NURSING Dream ● Discover ● Deliver
Variance
A good measure of dispersion should make use of all the data.
Intuitively, a good measure could be derived by combining, in
some way, the deviations of each observation from the mean.
The variance achieves this by averaging the sum of the squares
of the deviations from the mean.
46
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
47. NURSING Dream ● Discover ● Deliver
Variance cont…
The population variance of a population data set of N entries is
2
2 ( )
.
x μ
N
The sample variance of the set x1, x2, ..., xn of n
observations with mean x is
S
(x x)
n -1
2
i
2
i=1
n
Note : The sum of the deviations from the mean is zero, thus it
is more useful to square the deviations, add them, find the
mean (to get the variance).
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
48. NURSING Dream ● Discover ● Deliver
Standard Deviation
Being the square of the deviations, the variance is limited as
a descriptive statistic because it is not in the same units as
in the observations.
By taking the square root of the variance, we obtain a
measure of dispersion in the original units.
It is usually denoted by s.d or simply s and the formula is
given by:
48
1
-
n
)
x
(x
S
n
1
=
i
2
i
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
49. NURSING Dream ● Discover ● Deliver
Examples
Example 1: Let us use the age data of 15 individuals
Example 2: consider the example of the blood lead level of 88
individuals given before . Find its variance
Solution
49
86
.
29
88
9)
3
...
22
21
(20
n
x
=
x
88
1
=
i
n
1
=
i
i
46
.
20
1
-
88
)
x
(x
S
88
1
=
i
2
i
2
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
47
.
42
,
,
12
.
38
1
-
15
)
x
(x
S
15
1
=
i
2
i
2
X
Where
50. NURSING Dream ● Discover ● Deliver
Coefficient of variation
When we want to compare the variability in two sets of data, the
standard deviation which calculates the absolute variation may
mislead us especially if the two data sets are:
with different units of measurement ,or
have widely different means
The coefficient of variation (CV) gives relative variation & is the
best measure used to compare the variability in two sets of data.
CV is often presented as the given ratio multiplied by 100%.
50
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
51. NURSING Dream ● Discover ● Deliver
Mean, standard deviation and the
normal distribution
For unimodal, moderately symmetrical, sets of data
approximately:
68% of observations lie within 1 standard deviation of
the mean.
95% of observations lie within 2 standard deviations of
the mean.
i.e. Normally Distributed Data
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
52. NURSING Dream ● Discover ● Deliver
x
The Empirical
Rule
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
53. NURSING Dream ● Discover ● Deliver
x - s x x + s
68% within
1 standard deviation
34% 34%
The Empirical Rule
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
54. NURSING Dream ● Discover ● Deliver
x - 2s x - s x x + 2s
x + s
68% within
1 standard deviation
34% 34%
95% within
2 standard deviations
The Empirical Rule
13.5% 13.5%
55. NURSING Dream ● Discover ● Deliver
x - 3s x - 2s x - s x x + 2s x + 3s
x + s
68% within
1 standard deviation
34% 34%
95% within
2 standard deviations
99.7% of data are within 3 standard deviations of the mean
The Empirical Rule
0.1% 0.1%
2.4% 2.4%
13.5% 13.5%
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
56. NURSING Dream ● Discover ● Deliver
Choosing Appropriate measures
If data are symmetric, with no serious outliers, use mean
and standard deviation.
If data are skewed, and/or have serious outliers, use IQR
and median.
If comparing variation across two variables, use coefficient
of variation if the variables are in different units and/or
scales or the means are significantly different.
If the scales/units and mean are roughly the same direct
comparison of the standard deviation is fine.
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
57. NURSING Dream ● Discover ● Deliver
Median Mode Mean
Fig. 2(a). Symmetric Distribution
Mean = Median = Mode
Mode Median Mean
Fig. 2(b). Distribution skewed to the right
Mean > Median > Mode
Mean Median Mode
Fig. 2(c). Distribution skewed to the left
Mean < Median < Mode
57
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar
Notas do Editor
page 79 of text
Some student have difficulty understand the idea of ‘within one standard deviation of the mean’. Emphasize that this means the interval from one standard deviation below the mean to one standard deviation above the mean.