SlideShare uma empresa Scribd logo
1 de 326
Matthews Lazaro
MSc Biostatistics
DESCRIPTIVE STATISTICS
KAMUZU COLLEGE OF NURSING
Basic Definitions
 Statistics is the science that deals with the
collection, classification, analysis, interpretation
and presentation of numerical facts or data.
 Data Collection
Sources of data are many, the clinical area is one
where measurements from patients could be a data
source.
There are variables that could be measured such
as length of stay in the ward for patients, age of
patients, types of diseases or conditions, distance
travelled to the health facility etc.
For example
Example
 The following data could be collected from under-
five children ward on the length of stay by
patients
; 2 days, Brown; 7 days Black, 0.5 days
Sample and Population Symbols
As we progress in this course there
will be different symbols that
represent the same thing. The only
difference is that one comes from a
sample and one comes from a
population.
Symbols under this topic
Sample Mean:
Sample variance :s2
Sample Standard Deviation:s
Population Mean:
Population variance: σ2
Population Standard deviation:σ

x
Classification
 Normally when data is collected, it is raw i.e. it is
not processed.
 For example the data collected on length of stay
in the under-five ward is raw data.
 One can present this data in groups called
classes e.g. 0 - 5 days, 6-10 days, 11-15 days etc
 Each class will have corresponding frequencies
 Data presented in classes and corresponding
frequencies is called frequency distribution.
Example
No. of days in Ward in days
(Class)
No. of patients (Frequency)
0-5 1
6-10 8
11-15 15
16-20 9
21-25 5
26-30 2
Total 40
 This data needs to be analyzed and presented in
a form that could easily be understood by most
people who may not know the intricacies of data
analysis
Interpretation
 Data analysis and interpretation is the process of
assigning meaning to the collected information
and determining the conclusions, significance,
and implications of the findings.
 In a situation where there has been an
intervention, the purpose of the data analysis and
interpretation phase is to transform the data
collected into credible evidence about the
performance of say an intervention.
 For the frequency distribution above, the analysis
and interpretation of measures of central
tendency such as the mean, measures of spread
such as the standard deviation etc
Presenting data in diagrams
and charts
 Quantitative data is usually presented in figures
and tables
(a) Bar Chart
 Used for discrete data. The categories on the x-
axis are not linked. Table 1 shows hypothetical
colours of eyes for patients in a hospital.
Table 1: Frequency Distribution of eyes
Colour of eyes No of Patients
Black 11
White 3
Red 14
Brown 25
Blue 5
Figure 1
Pie Chart
 A pie chart (or a circle chart) is a
circular statistical graphic, which is divided
into sectors to illustrate numerical proportion.
 The Pie Chart may be used for both continuous
as well as discrete data.
Figure 2
(c) Histogram
 A Histogram is a graphical display of data using
bars of different heights. It is similar to Bar Chart
only that a Histogram is used to display
continuous data and hence the bars touch each
other.
 A histogram is a very important chart and is used
in many situation in statistics hence details of its
construction are discussed in later sections but
basically a histogram looks as in Figure 2
Figure 3
Types of data
 Data refers to the information that has been
collected from an experiment or a
survey/research, or some historical record.
 Collected statistical data falls into one of two
categories, discrete data or continuous data
 Discrete data is a set of data values which
occupies only whole number values, often a
count or score
Example;
 number of patients admitted in a ward etc
 Continuous data is any data that has infinite
values with connected data points, often a
measurement.
 Continuous data will occupy both whole number
as well as fractional parts.
 Examples of continuous data include;
height of a person (e.g. 1.72m; 1 is the whole
number part while 0.72 is the fractional part), baby
birth weight, distance covered in a race etc.
 Data that is collected may be presented raw or
grouped
 As an example, 100 birth weights for babies
born at a clinic in Chiradzulu were presented raw
as follows;
3.1 3.3 1.3 2.9 2.2 3.4 4.1 5.1 4.9 4.0 5.2
1.8 2.1 3.2 2.2 3.3 2.4 3.4 2.5 3.1 2.6 3.2
2.7 4.0 3.3 2.8 4.1 1.1 2.9 3.5 4.2 1.9 3.6
3.0 2.1 2.2 3.8 2.3 3.4 4.6 4.7 3.4 3.5 3.7
3.8 2.7 2.9 2.8 3.1 3.3 3.4 2.6 3.5 4.8 4.6
4.3 2.6 3.2 2.7 4.0 3.3 2.8 4.1 1.1 2.9 3.5
4.2 1.9 3.6 3.0 2.1 2.2 3.1 3.3 1.3 2.9 2.2
3.4 4.1 5.1 4.9 4.0 5.2 1.8 2.1 3.2 2.2 3.3
 We can organize this data into five classes as
shown in Table 1;
Class Frequency
1.1-2.0 9
2.1-3.0 33
3.1-4.0 38
4.1-5.0 16
5.1-6.0 4
Total 100
 Although the baby weights are presented to one
place of decimal, it is possible that some of the
weights were accurate to two places of decimal
 Suppose a baby’s weight were 3.06kg in which
class would we place that weight?
 It would not be in the class 2.1 – 3.0 because
3.06 is larger than 3.0. It would also not be in the
class 3.1 – 4.0 because 3.06 is less than 3.1
 This therefore means than the classes above
have gaps in them to which we would have many
babies unrecorded.
 The classes with gaps are called class limits
 In order to eliminate the gaps between the
classes we introduce what are called Class
Boundaries
 we firstly identify the gap between the classes in
the Class Limits
 In the case above, the gaps are 0.1 each i.e. from
3.0 to in the second class to 3.1 in third class, the
difference is 0.1
 If you divide this gap by 2 and use that to stretch
each class you end up with class boundaries
 For example
0.1/2=0.05
Then the class 2.1-3.0 will be stretched by 0.05
resulting into
2.05-3.05
The next class will be 3.05-4.05 and so on
Table
Class Boundaries Frequency
1.05-2.05 9
2.05-3.05 33
3.05-4.05 38
4.05-5.05 16
5.05-6.05 4
Total 100
 The value that is at the centre of the Class
Boundary is called the Class Mid-point such
that;
int
2
Upper Class Boundary Lower Class Boundary
Class Mid Po

 
Descriptive Statistics
 Descriptive statistics are numbers or data that are
used to summarize and describe data.
 Descriptive statistics tend to summarize a sample
in order to get an idea about the population
 The main features of the sample are also the
main features of a population.
Measures of Central Tendency
 A measure of central tendency is a value used to
represent the typical or “average” value in a data
set
 There are 4 values that are considered measures
of the center.
1. Mean
2. Median
3. Mode
Measures of Central Tendency
for raw data
 Suppose you are weighing babies born at your
clinic somewhere in Malawi, and the baby weights
(in kg) of the first 10 babies were as follows:
2.7, 3, 3.0, 4.1, 5.2, 1.9, 2.3, 3.0.3.3, 3.0
What single figure could represent the baby
weights at this clinic?
Lets see how different measures of central
tendency are computed.
The mode
 The mode is the data value or datum (or value)
which appears the largest number of times in the
set or the most frequently occurring figure in the
set
 If no data value is repeated, we say there is no
mode.
Using the following data set;
2.7kg, 3.4kg, 3.0kg, 4.1kg, 5.2kg, 1.9kg, 2.3kg,
3.0kg, 3.3kg, 3.0kg.
The mode is 3.0kg (highest frequency)
The Median
 The median is defined as the middle figure after
the data set is ranked or placed in order of
magnitude.
Example
22, 29, 35, 24, 26, 15, 28, 36, 45, 21, 33, 5, 46, 21,
19, 41, 5, 84, 58, 63, 5, 23
Find the median.
Solution
Rank the data in ascending order
5, 5, 5, 15, 19, 21, 21, 22, 23, 24, 26, 28, 29, 33,
35, 36, 41, 45, 46, 58, 63, 84
 The pick the two middle numbers (because the
set is even)
5, 5, 5, 15, 19, 21, 21, 22, 23, 24, 26, 28, 29, 33,
35, 36, 41, 45, 46, 58, 63, 84
 The two middle figures are 26 and 28. The
average of these two figures is the median i.e.
(26+28)/2 = 27 is the median.
The Arithmetic Mean
 The Arithmetic Mean is the sum of all data values
divided by the number of values in the data set
 The mean of a sample data set is denoted by ..
 The mean of a population data set is denoted by
..
x

 Mean is given by
1
n
i
i
x
x
n



Where n is number of observation, i runs from 1 to n
Example
Use the following data set to compute a sample mean
1,65kg, 3.3kg, 4.1kg, 3.0kg, 3.1kg 2.9kg 2.8kg, 3.2 kg, 3.0kg, 3.0kg
1.65 3.3 4.1 3 3.1 2.9 2.8 3.2 3 3
x 3.005
10
kg
        
 
Measures of Central Tendency
for grouped data
The Mode
 When data is presented in a frequency
distribution, the mode is not found by inspection.
The mode for grouped data may be found by
using two methods:
(a) Graphically
(b) analytical (use of a formula)
Finding the Mode graphically
 Consider the weights of the 100 babies born at
Mbulumbuzi Health Centre.
Worked Example
Class Limits Class Boundaries Frequency (f)
1.10-1.50 1.05-1.55 1
1.60-2.00 1.55-2.05 10
2.10-2.50 2.05-2.55 14
2.60-3.00 2.55-3.05 21
3.10-3.50 3.05-3.55 30
3.60-4.00 3.55-4.05 13
4.10-4.50 4.05-4.55 6
4.60-5.00 4.55-5.05 3
5.10-5.50 5.05-5.55 2
Table Frequency distribution with Class Boundaries.
The class boundaries are plotted on the x – axis while on the y – axis
the class frequencies are plotted.
Figure…… weights of the babies born at Mbulumbuzi Health
Centre.
 How to determine the mode.
1st step ; identify the modal class (3.05-3.55)
2nd step; identify the frequency of the class before
and after the modal class on the chart (2.55-
3.05 and 3.55 – 4.05)
These should be identified on the chart as shown in
the subsequent figures
Figure 1.1 Figure 1.2
 In Figure 1.1, the frequency for the class before
the modal class is represented by the point A
(corner), The frequency for the modal class is
represented by the positions B and C and the
frequency for the class after the modal class is
represented by the point D.
 Note that if the frequency of the class before the
modal class is higher than that of the class after
the modal class, the position (value of the Mode)
of the mode is closer to the lower class boundary
of the modal class as is the case in Figure 1.2,
Finding the Mode analytically
1
1 2
*
D
Mode L C
D D
 
   

 
Where;
L : is the lower class boundary of the modal class,
D1: is the frequency of the class before the modal class,
D2: is the frequency of the class after the modal class and
C : is the class width of the modal class.
The Median
 Definition – the median is the value which
separates the largest 50% of data values from the
lowest 50% or the middle value after the data is
ranked.
 Just like the mode, the median may be found
using two main methods; i.e.
a. Graphically
b. Analytical (use of a formula)
 Table ……
Class Limits Class Boundaries
Frequency
(f)
“Or less”
Cumulative
frequency
“Or more”
Cumulative
frequency
1.10-1.50 1.05-1.55 1 0 100
1.60-2.00 1.55-2.05 10 1 99
2.10-2.50 2.05-2.55 14 11 89
2.60-3.00 2.55-3.05 21 25 75
3.10-3.50 3.05-3.55 30 46 54
3.60-4.00 3.55-4.05 13 76 24
4.10-4.50 4.05-4.55 6 89 11
4.60-5.00 4.55-5.05 3 95 5
5.10-5.50 5.05-5.55 2 98 2
>5.55 100 0
Finding the Median graphically
 We shall first look at a new frequency
distribution called the cumulative frequency
distribution. This where the class
frequencies are cumulated from 0 to the total
frequency or ∑f or from ∑f to 0.
How to compute the cumulative
frequencies
The less cumulative frequency
 1st step: By asking questions about the lower
class boundary as follows;
How many people had a value of 1.05 or less?
The answer is zero (0)
 2nd step: By asking questions about the upper
class boundary as follows;
How many people had a value of 1.55 or less?
The answer is one (1) which is the frequency
for the class 0.95 – 1.55
3rd step: Next is how many people had values of
2.05 or less?
Answer is 11 which is the 10 in the class 1.55 –
2.05 and the 1 in the class 0.95 – 1.55.
You continue like that!!!!
The “or more” cumulative frequency distribution is
found in a similar manner.
The cumulative frequency distribution is used to
plot a chart called the Ogive or the Cumulative
Frequency Curve.
Figure ……
 In this case there were 100 babies, so the value
of the 50th baby can be read on the x-axis which
is the Median.
Finding the Median analytically
 The Median is found by;
2
*
b
N
Cf
Median L C
f

 
 
   
 
 
L : is the lower class boundary of the median class,
N : is the total frequency,
f : is the frequency of the median class
Cfb : is the cumulative frequency of the class before the median class and
C : is the class width of the median class.
…
 The Median class is the class in which the
median will be found.
 It is the class in which the half-way member is
 It can be found by using the cumulative
frequencies to identify where the half-way
member is.
The arithmetic mean
 For grouped data, the arithmetic mean has to
take into consideration the frequencies as well as
the class size.
 For each class, the value that represents the
class is the class midpoint.
 This value will be the one which now will have the
stated frequency
Table ……
Class
Limits
Class
Boundaries
Midpoint
(x) frequency (f) fx
1.10-1.50 1.05-1.55 1.3 1 1.3
1.60-2.00 1.55-2.05 1.8 10 18
2.10-2.50 2.05-2.55 2.3 14 32.2
2.60-3.00 2.55-3.05 2.8 21 58.8
3.10-3.50 3.05-3.55 3.3 30 99
3.60-4.00 3.55-4.05 3.8 13 49.4
4.10-4.50 4.05-4.55 4.3 6 25.8
4.60-5.00 4.55-5.05 4.8 3 14.4
5.10-5.50 5.05-5.55 5.3 2 10.6
Σƒ=100 Σfx=309.5
 Class midpoint (x )= (Upper class boundary +
Lower class boundary)/2
 The total (sum of values is obtained by adding up
the fx column
 The mean for grouped data is obtained by
dividing this total by the sum of frequencies.
 Arithmetic mean for grouped data is given by
fx
x
f



 For the data above,
095
.
3
100
5
.
309





f
fx
x
Measures of Dispersion
Dispersion
The measure of the spread or
variability
No Variability – No Dispersion
Measures of Variation
There are 2 values used to
measure the amount of
dispersion or variation. (The
spread of the group)
1. Range
2. Standard Deviation
Why is it Important?
You want to choose the best
brand of medicine for your
patients. You are interested in
how long the drugs take to cure
a disease. The choices are
narrowed down to 2 different
drugs. The results are shown in
the chart. Which drug would
The chart
indicates
the number
of days a
drug takes
to cure a
particular
disease.
Drug A Drug B
10 35
60 45
50 30
30 35
40 40
20 25
210 210
Does the Average Help?
Drug A: Avg = 210/6 = 35 days
Drug B: Avg = 210/6 = 35 days
They both last 35 days to cure a
disease. No help in deciding
which to buy.
Consider the Spread
Drug A: Spread = 60 – 10 = 50
days
Drug B: Spread = 45 – 25 = 20
days
Drug B has a smaller variability
which means that it performs more
consistently. Choose drug B.
Range
The range is the difference
between the lowest value in
the set and the highest value
in the set.
Range = High # - Low #
Example
Find the range of the data set.
40, 30, 15, 2, 100, 37, 24, 99
Range = 100 – 2 = 98
Deviation from the Mean
 A deviation from the mean, x – x bar, is
the difference between the value of x and
the mean x bar.
We base our formulas for variance and
standard deviation on the amount that
they deviate from the mean.
Formulae for sample and
population variances
Definition /Computation
formula
Machine Formulae
1
)
( 2
2
2





n
n
x
x
s
2
2 1
( )
1
n
i
i
x x
S
n





2
2 1
( )
N
i
i
x
N

 



2
2
2
( )
i
x
x
N
N





Standard Deviation
The standard deviation is the
square root of the variance.
2
s
s 
Example – Using Formula
Find the variance of the
following dataset 6, 3, 8, 5, 3
(in hours)
6 36
3 9
8 64
5 25
3 9
x 2
x
25

 x 143
2

 x
5
.
4
4
18
4
125
143
4
5
25
143
2
2






s
1
)
( 2
2
2





n
n
x
x
s
Find the standard deviation
The standard deviation is the
square root of the variance.
12
.
2
5
.
4 

s
Standard deviation for grouped
data
 For grouped data, the standard deviation has to
take into account the class frequencies, the class
width as well as the value of the mean.
 The mean , is calculated as stated earlier;
x
fx
x
f



Worked Example
Class Limits Frequency (f)
1.10-1.50 1
1.60-2.00 10
2.10-2.50 14
2.50-3.00 21
3.10-3.50 30
3.60-4.00 13
4.10-4.50 6
4.60-5.00 3
5.10-5.50 2
Total 100
Table Frequency distribution with Class Boundaries.
Compute the standard deviation of grouped data presented in the table below.
Worked Example
Class Limits Class Boundaries
Class Midpoint
(x) Frequency (f)
1.10-1.50 1.05-1.55 1.3 1
1.60-2.00 1.55-2.05 1.8 10
2.10-2.50 2.05-2.55 2.3 14
2.60-3.00 2.55-3.05 2.8 21
3.10-3.50 3.05-3.55 3.3 30
3.60-4.00 3.55-4.05 3.8 13
4.10-4.50 4.05-4.55 4.3 6
4.60-5.00 4.55-5.05 4.8 3
5.10-5.50 5.05-5.55 5.3 2
Table Frequency distribution with Class Boundaries.
For each class, the value that represents the class is the class midpoint.
100
f 

Worked Example
Class
Limits
Class
Boundaries
Class
Midpoint (x)
Frequen
cy (f)
Deviance
(x-
1.10-1.50 1.05-1.55 1.3 1 -1.795
1.60-2.00 1.55-2.05 1.8 10 -1.295
2.10-2.50 2.05-2.55 2.3 14 -0.795
2.50-3.00 2.55-3.05 2.8 21 -0.295
3.10-3.50 3.05-3.55 3.3 30 0.205
3.60-4.00 3.55-4.05 3.8 13 0.705
4.10-4.50 4.05-4.55 4.3 6 1.205
4.60-5.00 4.55-5.05 4.8 3 1.705
5.10-5.50 5.05-5.55 5.3 2 2.205
100
f 

x
From the table above,
There is need to get
2
( )
f x x


2
( )
0.6572945 0.8107
f x x
f


  


Then variance can be computed as below
Standard deviation can be computed as below
2
2
( ) 65.72945
0.6572945
100
f x x
f


  


The better formula for computation is ;
2
2
fx
x
f
  


Interquartile Range
• The interquartile range tells you the
spread of the middle half of your
distribution.
• Quartiles segment any distribution
that’s ordered from low to high into
four equal parts.
• The interquartile range (IQR) contains
the second and third quartiles, or the
middle half of your data set.
Remember the range gives you the
spread of the whole data set, the
interquartile range gives you the range
of the middle half of a data set
Calculation of IQR
The interquartile range is found by
subtracting the Q1 value from the Q3
value
Formula
Explanation
IQR = interquartile range
Q3 = 3rd quartile or 75th
percentile
Q1 = 1st quartile or 25th
percentile
 Q1 is the value below which 25 percent of the
distribution lies, while Q3 is the value below which
75 percent of the distribution lies.
 You can think of Q1 as the median of the first half
and Q3 as the median of the second half of the
distribution.
Methods for finding the
interquartile range
 Although there’s only one formula, there are
various different methods for identifying the
quartiles. You’ll get a different value for the
interquartile range depending on the method you use.
 Here, we will discuss two of the most commonly
used methods. These methods differ based on how
they use the median.
Exclusive method vs inclusive
method
 The exclusive method excludes the median
when identifying Q1 and Q3,
 the inclusive method includes the median in
identifying the quartiles.
Remember!
 The procedure for finding the median is
different depending on whether your data set
is odd- or even-numbered.
When you have an odd number of data
points, the median is the value in the middle
of your data set. You can choose between the
inclusive and exclusive method.
With an even number of data points, there
are two values in the middle, so the median is
their mean. It’s more common to use the
exclusive method in this case.
There is little consensus on the best
method for finding the interquartile
range, the exclusive interquartile range
is always larger than the inclusive
interquartile range.
The exclusive interquartile range may
be more appropriate for large samples,
while for small samples, the inclusive
interquartile range may be more
representative because it’s a narrower
range
Steps for the exclusive method
 Even-numbered data set (n=10)
Step 1: Order your values from
low to high.
Step 2: Locate the median, and
then separate the values below it
from the values above it
.
Step 3: Find Q1 and Q3.
Q1 is the median of the first half and
Q3 is the median of the second half.
Since each of these halves have an
odd number of values, there is only
one value in the middle of each half.
Step 4: Calculate the
interquartile range.
Odd-numbered data set (n=11)
Step 1: Order your values
from low to high.
Step 2: Locate the median, and
then separate the values below it
from the values above it.
Step 3: Find Q1 and Q3.
Step 4: Calculate the
interquartile range.
Steps for the inclusive method
Almost all of the steps for the
inclusive and exclusive method are
identical. The difference is in how
the data set is separated into two
halves.
The inclusive method is sometimes
preferred for odd-numbered data
sets because it doesn’t ignore the
n=11
Step 1: Order your values from
low to high
Step 2: Find the median.
Step 2: Separate the list into two
halves, and include the median in
both halves.
Step 3: Find Q1 and Q3.
Step 4: Calculate the
interquartile range.
When is the interquartile range
useful?
 The interquartile range is an especially useful
measure of variability for skewed distributions.
 For these distributions, the median is the best
measure of central tendency because it’s the
value exactly in the middle when all values
are ordered from low to high.
 The IQR is also useful for datasets
with outliers. Because it’s based on the middle
half of the distribution, it’s less influenced by
extreme values.
Visualize the interquartile
range in boxplots
A boxplot, or a box-and-whisker
plot, summarizes a data set visually
using a five-number summary.
Every distribution can be organized
using these five numbers:
Lowest value
Q1: 25th percentile
Median
Q3: 75th percentile
Highest value (Q4)
The vertical lines in the box show
Q1, the median, and Q3, while the
whiskers at the ends show the
highest and lowest values.
In a boxplot, the width of the box
shows you the interquartile range. A
smaller width means you have less
dispersion, while a larger width
means you have more dispersion
An inclusive interquartile range will
have a smaller width than an
exclusive interquartile range.
Boxplots are especially useful for
showing the central tendency and
dispersion of skewed distributions.
The placement of the box tells you
the direction of the skew.
A box that’s much closer to the right
side means you have a negatively
skewed distribution.
A box closer to the left side tells you
that you have a positively skewed
distribution.
PROBABILITY DISTRIBUTIONS
Introduction to Probability
 A Probability Experiment is a process which
leads to well-defined results called outcomes.
 For example, the toss of a coin is a probability
experiment because it leads to results called
outcomes such as “Heads” and “Tails”.
 There so many such probability experiments such
as about the toss of two coins, the roll of a die etc
 The set of all possible outcomes from these
probability experiments and others is called a
Sample Space
For example
 If a coin is tossed, the sample space is {H,T}
 If flipping two coins, the sample space is {HH, HT,
TH, TT}
Event
 is one or more outcomes of a probability
experiment
 Getting a “Head” in a toss of a coin is an event.
 Getting “Heads” on both tosses of two coins is an
event
 Probability is defined as the likelihood of an
event happening.
 The probability of an event E, denoted P(E) is a
definition of how likely that event is to happen.
 This definition is usually numerical.
 The value of the probability of any event is always
between zero and one inclusive
Two main approaches to
probability
1. The Classical Approach
2. Empirical Approach
Classical approach/definition to Probability
 The Classical definition of the probability of
the event E is defined as the number of ways
or times the event E occurs divided by the
number of all possible outcomes including the
event E.
Mathematically, this can be expressed as follows:
s in
( )
l s
Thenumber of times or way which the Event E occurs
P E
The tota number of All possible Outcome including the event E

Example
 If a doctor sees 10 patients with malaria, 5
patients with diarrhoea, 15 patients respiratory
problems and 20 patients with skin diseases, he
will have seen 50 patients on the day. If he needs
to interview, at random, one of the patients seen
on the day to give him an indication of how his
service was, what is the probability that the
patient to be interviewed will have skin diseases?
(Skin Disease)
l
Thenumber patients with skin disease
P
The tota number of All patients

(Skin Disease)
l
20
50
0.4
Thenumber patients with skin disease
P
The tota number of All patients



Empirical approach
 Empirical probability is based on past
observations.
 The empirical probability of an event is the
relative frequency of a frequency distribution
based upon past observations.
 The definition of the empirical probability of any
event E is the number of times the event E
occurred in the past divided by the total number
of times the experiment was carried out
 Mathematically,
s in
( )
l exp
Thenumber of times or way which the Event E occured
P E
The tota number of timesthe eriment was carried out

Limiting values of probability
 When the probability of an event is zero (0), the
event is said to be an absolute impossibility i.e.
there is absolutely no way the event can happen
 When the probability of an event is one (1), the
event is said to be an absolute certainty
0 ( ) 0
P E
 
 Class to suggest events in life whose probability
is zero
 Class to suggest events in life whose probability
is one.
Counting Rules
 1 Factorials
Definition: Factorial 4 ! = 4 x 3 x 2 x 1 and 7! = 7 x
6 x 5 x 4 x 3 x 2 x 1
 2. PERMUTATION RULES
Definition:
)!
(
!
r
n
n
P
r
n


6720
4
5
6
7
8
1
2
3
1
2
3
4
5
6
7
8
)!
5
8
(
!
8
60
3
4
5
1
2
1
2
3
4
)!
3
5
(
!
5
5
8
3
5










x
x
x
x
x
x
x
x
x
x
x
x
x
P
x
x
x
x
x
x
P
Example
Combination
 Definition:
!
)!
(
!
r
r
n
n
Cr
n


Example
210
!
6
!
4
!
6
7
8
9
10
!
6
)!
6
10
(
!
10
10
!
2
!
3
1
2
3
4
5
!
2
)!
2
5
(
!
5
6
10
2
5








x
x
x
x
C
x
x
x
x
C
Probability Laws
 Consider a bag containing coloured marbles; 10
black, 5 red, 5 blue and 3 yellow, then the
probability of picking a green marble from this bag
is 0 because there are no green marbles in the
bag. What is the probability of picking?
A black marble?
A yellow marble?
A marble that is not black?
Lets compute the probabilities
23
10
)
(
)
(


black
P
marbles
of
number
Total
appears
marble
black
a
ways
or
times
of
Number
black
P
23
13
)
(
)
(



black
Not
P
marbles
of
number
Total
appears
marble
black
non
a
ways
or
times
of
Number
black
Not
P
 The above results indicate that P(black)=10/23
and P(not black)=13/23 are complementary. They
add up to 1 i.e. P(Black) + P(not Black) = 1. This
shows that the sum of all probabilities in the
sample space is 1 and also giving the basic rule
of probability which says that the probability of an
event occurring plus the probability of the event
not occurring is equal to 1.
P(E) + P(not E) = 1
The Addition law of probabilities
 A pack off cards has 52 cards (excluding the
Jokers). The cards are in two basic colours, black
and red. Of the 52 cards, half (26 cards) are red
while the other half are black. The picture above
shows the 52 cards.
 Flowers (13) and Spades (13) are black as shown
above while Hearts (13) and Diamonds (13) are
red. Each deck of cards has an Ace (the cards on
the far left).
 The probability of pick a Heart
)
(
)
(
)
(
52
13
)
( flower
P
spade
P
diamond
P
heart
P 



52
26
)
(Re 

pack
a
in
cards
of
number
Total
cards
red
of
Number
Card
d
P
Note that the event "Red Card" is a compound event i.e. it contains other
events. The event "Red Card" is actually the event "Hearts" or "Diamonds"
i.e.
52
13
52
13
52
26
)
(
)
(Re 


 Diamonds
or
Hearts
P
Card
d
P
 This observation is actually true for any two
events which are mutually exclusive.
 Events are said to be mutually exclusive if they
both cannot happen at the same time.
 If two events are mutually exclusive, then the
probability of either event occurring is the sum of
the probabilities of each occurring.
 This is called the Addition Law of probabilities
for mutually exclusive events.
 In general therefore, if two events A and B are
mutually exclusive, then the probability of event A
or B happening is sum of the individual
probabilities i.e
)
(
)
(
)
( B
P
A
P
B
or
A
P 

Example 2.
What is the probability of picking a Spade or a Heart from a pack of cards?
Example 3
What is the probability of picking a Spade or an Ace from a pack of
cards?
 Note that the two events “Spade” and “Ace” are
not mutually exclusive because they both can
happen at the same time, i.e. there is a card that
is both an Ace and a Spade. The card is the Ace
of Spades. )
(
)
(
)
(
)
( B
and
A
P
B
P
A
P
B
or
A
P 


The Multiplication law of
probabilities
 Consider the toss of a coin. The probability of
getting a “Head” when a coin is tossed is 0.5.
 Suppose one wants to have two tosses. Is there a
difference in outcomes if one person tosses twice
compared to two people tossing once? Why?
 The discussion will have shown that tossing a
coin twice by the same person is the same as two
people tossing a coin once each.
 The reason is that, as far as outcomes are
concerned, the result of the first toss is
independent of the result of the second toss
when one person tosses a coin twice.
 In general, events are said to be independent if
the occurrence of one event does not affect the
occurrence of the other in any way or two events
are independent if the occurrence of one does not
change the probability of the other occurring.
 Consider the toss of two coins; what is the
probability of getting “Heads” on both tosses?
COIN A COIN B
H H
H T
T H
T T
 There are 4 possible outcomes when two coins
are tossed (HH, HT, TH and TT).
 Out the four possible outcomes, only one has
Heads (H) on both Coin A and Coin B.
 The probability of getting "Heads" on both coins
when two coins are tossed is
 P (Head and Head) = but = x
 P(Head and Head) = x
 In general, if events A and B are independent, the
probability of event A and event B happening is
given by;
 P(A and B) = P(A) x P(B)
1
4
1
4
1
2
1
2
1
4
1
2
1
2
There are 20 marbles in the bag of which 6 are red, 2 are blue and
the rest are white. What is the probability of picking a white ball from
the bag?
The number of times white marbles occur in the bag
The total number of all possible outcomes including the white marbles
. . . . . . . . .
. . . . . . . . . .
12
20
P(E) =
=
 If two marbles were to be picked from the bag
with replacement, what is the probability that
both marbles would be white?
 The answer to this question is from the
multiplication law of probabilities i.e. (A and B) =
P (A) x P (B)
 However if the marbles are picked without
replacement the situation would be different.
 The probability of picking a white marble the first
time would
 The probability of picking a white marble the first
time would remain because the number of
white marbles is 12 and the total number of
marbles in the bag is 20.
 When the first marble is picked and then not put
back in the bag, the total number of marbles in
the bag reduces to 19.
 The probability of picking the colour of the marble
that has been taken out of the bag.
 If the marble taken out of the bag from the first
pick is white, then the probability of a white
marble the second time around is
12
20
11
19
 If the marble taken out of the bag from the first
pick is not white, then the probability of a white
marble the second time around is
 This therefore means that there are two possible
solutions to the probability of picking two white
marble without replacement;
 P(White and White) = x if the marble in the
first pick was white.
 P(White and White)= x if the marble in the
first pick was not white.
12
19
12
20
11
19
12
20
12
19
 In other words, the probability of picking a white
marble the second time is dependent on the
result of the first pick.
 We say that the probability of picking a white
marble the second time is conditional on the
result of the first experiment.
 In general, if two events A and B are not
independent, i.e. the occurrence of one event
does affect the probability of the other occurring
(the events are dependent) the probability of both
events happening is given by;
 P(A and B)=P(A) x P(B|A)
 The probability of event B occurring given that
event A has already occurred is read "the
probability of B given A" and is written: P(B|A) this
is the conditional probability of B given that the
event A has already occurred and we have the
result of that experiment.
Probability tree diagrams
 Calculating probabilities can sometimes be
confusing. It may not be easy to tell when to use
the addition law, the multiplication law or a
combination of these.
 The probability tree diagram is a tool that can be
used to simplify otherwise complex looking
probability problems.
 A tree diagram is simply a way of representing a
sequence of events which are a set of
combinations of all possible outcomes from a
situation.
 A Tree diagram helps us to see all possible
outcomes of an event at a glance and simplifies
Example
 A hospital procurement department advertised for
three contracts for the supply of gloves worth
hundreds of thousands (Contract A), laboratory
equipment worth millions (Contract B) and a
dialysis machine worth tens of millions (Contract
C). A supply company bids for the three contracts.
The probability of getting contract A is 0.85. The
probability of getting contract B depends on
whether they get contract A or not. The probability
of getting contract B if they get A is 0.9 but only
0.2 if they fail to get contract A. The probability of
getting contract C depends on whether they get
contract B.
It is 0.95 if they get B but only 0.1 if they fail to get
contract B after getting contract A. If they fail to get
A and get B, the probability of getting contract C is
0.6. If they fail to get A and fail to get B, they are not
allowed to bid for contract C.
 Draw a tree diagram to illustrate the probabilities
of the outcomes. What is the probability of?
 Getting all three contracts
 Getting two contracts only
 Getting only one contact
 Getting no more than two contracts
 Getting at least two contracts
 Getting at most two contracts
 Getting no contract at all
 Getting contract B but not contract C
 Getting contract C but not contract
 Getting contract A but not the other contracts
Probability Distributions
 The right way is to start by introducing probability
density functions and that will lead to aspects of
calculus such as integrals which would put off
many.
 I will introduce probability distributions as the
distribution or break up of the total probability of 1
into several possible events or outcomes.
 As an example, the tree diagram has several
branches which are events or outcomes. The total
of probabilities from all branches is 1 but is
distributed into several events
 The total probability=
0.72675+0.03825+0.0085+0.0765+0.018+0.012+
0.12 = 1
 The total probability of 1 is distributed into seven
different events or outcomes. The seven events
are;
(i) get A, get B and get C
(ii) get A, get B and fail to get C
(iii) get A, fail to get B, get C
(iv) get A, fail to get B, fail to get C
(v) fail to get A, get B and get C
(vi) fail to get A, get B and fail to get C
(vii) fail to get A and fail to get B
 The probabilities of each of the seven events are
presented on the ends of each branch of the tree
diagram above.
 A listing of all the values a random variable can
assume with their corresponding probabilities
make a probability distribution. For example, the
toss of a coin:
Expected
Outcome (X)
Head Tail Total
Probability (X) 1/2 1/2 1
The total probability is 1.
 In many other situations, the total probability will
have to be distributed into several events or
outcomes (leading to fractions which will
eventually have to add up to 1).
 This is basically the whole concept of probability
distributions.
 A random variable does not mean that the
values can be anything (a random number)
 Random variables have a well defined set of
outcomes and well defined probabilities for the
occurrence of each outcome.
 For example, if you toss a coin, the known
outcomes are Heads and Tails and the probability
of each is 0.5 only that when a coin is to be
tossed, the outcome is not known, it can be any
hence the term random.
 Similarly, when a die is rolled, the known
outcomes are 1, 2, 3, 4, 5 and 6; the probabilities
of each event are also known to be 1/6 but when
the die is being rolled, any outcome can appear.
 The random refers to the fact that the outcomes
happen by chance -- that is, you don't know which
outcome will occur next.
 Here's an example of a probability distribution
that results from the rolling of a single fair die.
X 1 2 3 4 5 6 sum
P(x) 1/6 1/6 1/6 1/6 1/6 1/6 6/6=1
The Binomial Probability
Distribution
 The binomial distribution is one of the discrete
probability distributions. It is discrete because the
outcomes of the binomial experiments result in
whole number form other than fractional.
 Binomial experiments find probabilities of whole
number items and not fractional ones
Binomial Experiment
 A binomial experiment has the following;
1. A fixed number of trials
2. Each trial is independent of the others
3. There are only two outcomes
4. The probability of each outcome remains
constant from trial to trial.
 These can be summarized as: An
experiment with a fixed number of
independent trials, each of which can only
have two possible outcomes.
Examples of Binomial
Experiments
 Tossing a coin 6 times to see how many tails
occur.
There is a fixed number of tosses, i.e. 6.
Each toss has two possible outcomes. Each toss is
independent of the other and results of each toss
do not affect the results of the other tosses. The
probability of getting a Head or Tail is the same
throughout the 6 tosses.
Asking 20 people if they watch Television Malawi
(TVM).
You ask a fixed number of people i.e. 20. There are
two possible outcomes, either they watch or they
 Rolling a die 5 times to see if a 5 appears.
 The outcomes from tossing of coins can be
arranged in a triangular pattern deliberately.
 The pattern gives us a clue as to how we can
have outcomes for 5 coins and more!. Observe
the coefficients of the outcomes. We shall isolate
them and present them as follows;

Tossing of 4 coins.
No. Of coins Outcomes
1 1 1
2 1 2 1
3 1 3 3 1
4 1 4 6 4 1
 You will observe that each coefficient is the sum
of two coefficients above it! Such that for 5 coins,
we can come up with the coefficients as follows;
1 5 10 10 5 1
For 6 coins, the coefficients will be;
1 6 15 20 15 6 1 etc.
 The outcomes start with all successes on the left,
reduce by one every step and end with all failures
on the right.
 The other observation is that the number of coins
is the second coefficient.
 The other thing to note is that the coefficients are
symmetrical, whatever is on the left is the same
on the right.
 This triangle is called Pascal’s triangle in honour
of Blaise Pascal, a French mathematician who
discovered it.
 If the probability of success is denoted p and the
probability of failure q then the outcomes may be
presented in terms of probabilities as follows;
No. Of coins Probabilities
1 p q
2 p2 2pq q2
3 p3 3p2q 3pq2 q3
4 p4 4p3q 6p2q2 4pq3 q4
 For each experiment (coin), the total probability is
always equal to 1 i.e.
p + q = 1
p2 + 2pq + q2 = 1
p3 + 3p2q + 3pq2 + q3 = 1 .
p4 + 4p3q + 6p2q2 + 4pq3 +q4 = 1etc
 From your mathematics in secondary school, you
will recall the expansion of binomials such as
(x+y)2 , (a+b)3 etc.
 If you expand (a+b), (a+b)2, (a+b)3, (a+b)4...., the
coefficients of the terms are exactly the same as
the ones in Pascal’s triangle such that we can use
this property for probabilities i.e., for any n
binomial trials or experiments whereby the
probability of success is p and the probability of
failure is q, the probability distribution of the n
experiments is given by:
.....
!
3
)
2
)(
1
(
!
2
)
1
(
)
( 3
3
2
2
1








 


q
p
n
n
n
q
p
n
q
np
p
q
p n
n
n
n
n
Example:
What is the probability of rolling exactly two sixes in 6 rolls of a die?
There are five basic things you need to do to work a binomial problem like
this one.
1. Firstly define Success. Success in this case
must be for a single trial.
Success = "Rolling a 6 on a single die"
2. Define the probability of success p: p = 1/6
3. Find the probability of failure which is 1 - p: q
= 5/6
4. Define the number of trials: n = 6
5. Define the number of successes out of those
trials: x = 2
.....
!
3
)
2
6
)(
1
6
(
6
!
2
)
1
6
(
6
)
( 3
3
6
2
2
6
1
6
6
6








 


q
p
q
p
q
p
p
q
p
We need the term containing p2 which is the probability of two successes.
The term is;
4
4
6
!
4
)
3
6
)(
2
6
)(
1
6
(
6
q
p 



4
4
6
)
6
5
(
)
6
1
(
!
4
)
3
6
)(
2
6
)(
1
6
(
6 



2
.
0
0.48225
x
0.02777
15
)
6
5
(
)
6
1
(
24
360 4
4
6



x
Apart form using knowledge of Pascal’s Triangle,
we can use the knowledge of counting rules
 Example:
What is the probability of rolling exactly two
sixes in 6 rolls of a die?
1.Firstly define Success. Success in this case
must be for a single trial.
Success = "Rolling a 6 on a single die"
2. Define the probability of success p: p = 1/6
3. Find the probability of failure which is 1 - p: q
= 5/6
4. Define the number of trials: n = 6
5. Define the number of successes out of those
trials: x = 2
x
n
x
x
n
q
p
C
x
P
x
X
P 


 )
(
)
(
2
.
0
20093
.
0
)
6
5
(
)
6
1
(
15
)
(
)
6
5
(
)
6
1
(
)!
2
6
(
!
2
!
6
)
(
4
2
4
2





x
P
x
P
Example:
 A coin is tossed 10 times. What is the probability
that exactly 6 heads will occur.
Mean, Variance and Standard
Deviation
Example:
 Find the mean, variance, and standard deviation
for the number of sixes that appear when rolling
30 dice.
Normal Distribution
• Bell shaped.
• Gaussian curve” after
the mathematician Karl
Friedrich Gauss.
• Normal distributions are symmetric around their
mean.
• The mean, median, and mode of a normal
distribution are equal and located at the peak.
• The area under the normal curve is equal to 1.0.
• Normal distributions are denser in the center and
less dense in the tails.
Properties of a Normal
Distribution
This is to say that the normal
probability distribution is asymptotic
- the curve gets closer and closer to
the x-axis but never actually
touches.
Normal distributions are defined
by two parameters, the mean (μ)
and the standard deviation (σ).
Properties of a Normal
Distribution
68% of the area of a normal
distribution is within one standard
deviation of the mean.
Approximately 95% of the area of a
normal distribution is within two
standard deviations of the mean.
Properties of a Normal
Distribution
Properties of a Normal
Distribution
The parameters μ and σ are the mean and standard deviation, respectively, and
define the normal distribution. The symbol e is the base of the natural
logarithm and π is the constant pi.
2
1
( )
2
1
( )
2
x
f x e


 
 

The density of the normal distribution
(the height for a given value on the x-
axis) is shown below.
Empirical Rule
• Approximately 68 % of
the data lies in the interval
 

Figure 1. Empirical Rule
Empirical Rule
Example 1: Figure 2 shows a
normal distribution of age of
patients with a mean of 50yrs
and a standard deviation of 10.
The shaded area is between
40yrs and 60yrs. What
proportion of distribution does
the area contain.
Figure 2. Normal distribution of age of
patients
Empirical Rule
Example 2: A normal distribution of
concentration of glycogen in the blood has
a mean of 75mg and a standard deviation
of 10. The shaded area on the normal
distribution graph extends from 55.4mg to
94.6mg.
a. How many standard deviations are within the
shaded area?
b. Using Empirical rule, approximate the proportion
of the shaded area under the curve.
Standard Normal Distribution
i
i
x
z




The standard score and the standardized variable
For a population, the standard score (also called the
normal deviate, or z score or z value) is defined as:
and for a sample it is indicated
as
i
i
x x
z
s


Standard Normal Distribution
The standard score (z) shows how far any given data
value is from the mean of the distribution in standard
deviation units; how many standard deviations the value is
from the mean.
i
x
When for any variable X, each measurement value in a
sample or population is transformed into a z value, this
process is known as standardizing (or normalizing) the
variable, and the resulting variable Z is called a
standardized variable.
Standard Normal Distribution
Standard Normal Distribution
Example 3: Assuming the following sample
follows normal distribution, first calculate
and s, and then standardize the sample to
have a standard normal distribution: 3, 5, 7,
9, and 11.
Standard Normal Distribution
Solution:
35
7
5
i
x
x
n
  

2 2 2
( ) 5(285) (35)
3.16228
( 1) 5(4)
i i
n x x
s
n n
 
  

 
Standard Normal Distribution
1
2
3
4
5
3 7
1.2649
3.16228
5 7
0.6325
3.16228
7 7
0
3.16228
9 7
0.6325
3.16228
11 7
1.2649
3.16228
i
i
i
i
i
x x
z
s
x x
z
s
x x
z
s
x x
z
s
x x
z
s
 
   
 
   
 
  
 
  
 
  
Having determined
s and , we can
proceed and
compute z score for
each observation.
x
x
Finding Areas under the
Standard Normal Distribution
curve
Standard Normal Cumulative Probability
Table provides the cumulative
distribution function for values of z
rounded to the nearest hundredth.
This table provides the area under
the standard normal curve for
values of z less than those
identified in the table. This is
illustrated in the figure on the right
with the shaded region, labelled
probability.
Figure: Area under the curve
 The table below demonstrates how to
use the table to find the area under the
standard normal curve that lies to the
left of Z value.
 Lets suppose Z= 1.46. Notice that the
value 1.46 = 1.4 + .06.
 The value 1.4 is found by scrolling down
the first column of the table and the
value .06 is found by moving right
across the top row.
 The intersection within the table of the row of
1.4 and the column of .06 is the value .9279.
This is the area under the normal curve to the
left of Z = 1.46.
Table 1. Standard Normal
Cumulative Probability Table
Often times, we are interested in
finding the Z-score that corresponds
to a given area under the
standard normal curve. The process
involves searching the array of area
values and working backwards to
find the Z-score
Example 4: Using the tables, find
the Z-score that corresponds to an
area of 0.9050 under the standard
normal curve to the left of the Z-
score.
When searching the array of values,
the closes one we see is .9049.
This value is in the row of 1.3 and
the column of .01. Thus, the Z-
Table 2. Standard Normal
Cumulative Probability Table
Exercises 1.
Use Tables to find the following areas under the
standard normal curve.
1. The area that lies to the left of Z = -0.58.
2. The area that lies between Z = -1.16 and Z =
2.71.
3. The area that lies to the right of Z = 0.31.
Exercises 2.
1. Find the Z-score so that the area to the left of the
Z-score is 0.10.
2. Find the Z-score so that the area to the right of
the Z-score is 0.0735.
 We are often interested in finding the Z-score that
has a specified area to the right. For this reason,
we have special notation to represent this
situation
 The notation
 Pronounced as Z sub alpha is the Z-score such
that the area under the standard normal curve to
the right of is
 Find the value of

z

z 
05
.
0
z
 This means that the area under the curve is 0.05
and we need to find the corresponding values.
Since our tables indicate areas of z scores to the
left, let’s find the area of curve to the left of the z
score i.e 1-0.05=0.95
 Now let’s find the z score corresponding to the
0.95. From the tables, the corresponding z value
is 1.65
as a probability distribution
curve
 Recall that the area under the standard normal
distribution can be interpreted as either a
probability or as the proportion of the population
with the given characteristic. When interpreting
the area under the standard normal curve as a
probability, we use the following notation
 Notation for the Probability of a Standard Normal
Random Variable
 P(a < Z < b) represents the probability that a
standard normal random variable is between a
and b
P(Z > a) represents the probability that a standard
normal random variable is greater than a.
P(Z < a) represents the probability that a standard
normal random variable is less than a.
 Example 5: Let Z denote a sample of glucose
amount in the blood of patients which follows a
normal distribution with a mean of 0 and standard
deviation of 1.
a. Find P (Z > 2).
b. Find P (Z ≤ 1.73).
 Solution:
Since μ=0 and σ=1, the value of 2 is actually z=2
standard deviations above the mean. Proceed
down the first (z) column in standard normal tables
and read the area opposite z=2.0. This area
denoted by the symbol P(z), is P(2.0)= 0.9772. But
this is the probability to the left of z score.
For P(Z > 2)=1-0.9772=0.0228.
Therefore P(Z > 2)=0.0228
Z=1.73, therefore P(1.73)=0.9582
Therefore P(Z < 1.73)=0.9582
 Example 6: The achievement scores for a
college entrance examination are normally
distributed with mean 75 and standard deviation
10. What fraction of the scores lies between 80
and 90?
 Solution
The desired fraction of the population is given by
the area between
5
.
1
10
75
90
5
.
0
10
75
80
2
1 




 z
and
z
 P(0.5 < z < 1.5)=P(0.5)-P(1.5)=0.3085-
0.0668=0.2417
 Therefore the fraction of the scores lying between
80 and 90 is 0.2417
 Exercises 3.
 Let X denote a normal random variable with
mean 0 and standard deviation 1.
 Find P(−2 ≤ Z ≤ 2).
 The grade point averages (GPAs) of a large
population of Public Health College students are
approximately normally distributed with mean 2.4
and standard deviation 0.8. If students
possessing a GPA less than 1.9 are dropped from
college, what percentage of the students will be
dropped?
 The weekly amount of money spent on cleaning
the city was observed, over a long period of time,
to be approximately normally distributed with
mean $400 and standard deviation $20. How
much should be budgeted for weekly cleaning to
provide that the probability the budgeted amount
will be exceeded in a given week is only 0.1?
Suppose a clinically accepted value
for mean systolic blood pressure in
males aged 20 to 24 years is 120
mmHg and the standard deviation is
20 mmHg.
a). If a 22 year old male is selected
at random from the population,
what is the probability that his
systolic blood pressure is equal to
INFERENTIAL STATISTICS
(Sampling and Estimation)
 Statistical inference is the estimation of
the population parameters such as the population
mean, the population proportion etc. derived from
the analysis of a sample drawn from that
population.
 A sample is a small part of the population which is
used to analyse as an example of the character,
features or qualities of the population.
 Sampling is the process of selecting a sample of
people or products from a population which is to
be used as a representative of the population of
interest.
 An estimate is an approximate calculation of
something and estimation is the process of
coming up with an estimate of a population
parameter.
 There are several sampling methods which are
important for you to know in order to appreciate
the process of sampling and estimation, the
methods are briefly described below and you
should take time to read around them from other
Sampling Methods
Probability sampling
Non probability sampling
In the probability sample every member of the
wider population has an equal chance of being
included in the sample; inclusion or exclusion
from the sample is a matter of chance and
nothing else.
In the non-probability sample some members of
the wider population definitely will be excluded
and others definitely included (i.e. every member
of the wider population does not have an equal
chance of being included in the sample)
Types of Probability Sample
1. Simple random sampling
Each member of the population under study
has an equal chance of being selected and
the probability of a member of the population
being selected is unaffected by the selection
of other members of the population.
One problem associated with this particular
sampling method is that a complete list of the
population is needed and this is not always
readily available
2. Systematic Sampling
It involves selecting subjects from a population list
in a systematic rather than a random fashion.
For example, if from a population of, say, 2,000,
a sample of 100 is required, then every twentieth
person can be selected. The starting point for the
selection is chosen at random.
3. Stratified random sample
 Stratified sampling involves dividing the
population into homogenous groups, each group
containing subjects with similar characteristics.
 A stratified random sample is, therefore, a useful
blend of randomization and categorization,
thereby enabling both a quantitative and
qualitative piece of research to be undertaken.
4. Cluster sampling
 It involves the sampling of successively smaller
units
 Conditions for doing cluster sampling
1. The sampling frame can not be identified
2. Direct contacts needs to be made with the
sample units, but these are scattered around a
wide geographical area
 Cluster sampling is an example of 'two-stage
sampling' or 'multistage sampling': in the first
stage a sample of areas is chosen; in the second
stage a sample of respondents within those areas
is selected.

 Multistage sampling
Multistage sampling is a complex form of cluster
sampling in which two or more levels of units are
embedded one in the other.
The first stage consists of constructing the clusters
that will be used to sample from. In the second
stage, a sample of primary units is randomly
selected from each cluster (rather than using all
units contained in all selected clusters). In following
stages, in each of those selected clusters,
additional samples of units are selected, and so on.
 All ultimate units (individuals, for instance)
selected at the last step of this procedure are
then surveyed. This technique, thus, is essentially
the process of taking random samples of
preceding random samples.
Non probability samples
1. Convenience (Accidental/Opportunity) Sampling
It involves choosing the nearest individuals to serve as
respondents and continuing that process until the
required sample size has been obtained
The researcher simply chooses the sample from those
to whom she has easy access. As it does not represent
any group apart from itself, it does not seek to
generalize about the wider population
2. Quota Sampling
A quota sample strives to represent significant
characteristics (strata) of the wider population and it
sets out to represent these in the proportions in
which they can be found in the wider population.
For example, suppose that the wider population
(however defined) were composed of 55% females
and 45% males, then the sample would have to
contain 55% females and 45% males
3. Purposive Sampling
In purposive sampling, researchers handpick the
cases to be included in the sample on the basis
of their judgement of their typicality. In this way,
they build up a sample that is satisfactory to their
specific needs
Assumptions for one to use purposive sampling:
1. They possess the necessary knowledge
2. They have relevant experience
3. They are part of the social structure or process
on which the research is intended to focus
4. Snowball Sampling
A researchers identify a small number of
individuals who have the characteristics in
which they are interested. These people are
then used as informants to identify, or put the
researchers in touch with, others who qualify
for inclusion and these, in turn, identify yet
others
This method is useful for sampling a
population where access is difficult, maybe
because it is a sensitive topic or where
communication networks are undeveloped
What sample size do I need?”
The answer to this question is influenced by a number
of factors, including:
 the purpose of the study, population size, the risk of
selecting a “bad” sample and the allowable sampling
error.
 Data analysis plan e.g number of cells one will have
in cross tabulation
 Most of all whether undertaking a qualitative or
quantitative study
Sample size determination in
qualitative study
 Probability sampling not appropriate as sample
not intended to be statistically representative
 But, sample should have ability to represent
salient characteristics in population.
 Sample size taken until point of theoretical
saturation
…….
 Sample size is usually small to allow in-depth
exploration and understanding of phenomena under
investigation
 Ultimately a matter of judgement and expertise in
evaluating the quality of information against final use,
research methodology , sampling strategy and results is
necessary.
 In practice, qualitative sampling usually requires a
flexible, pragmatic approach.
…..
 The researcher actively selects the most productive
sample to answer the research question.
 This can involve developing a framework of the
variables that might influence an individual's
contribution and will be based on the researcher's
practical knowledge of the research area, the available
literature and evidence from the study itself.
• This is a more intellectual strategy than the simple
demographic stratification of epidemiological studies,
though age, gender and social class might be important
variables.
…….
 If the subjects are known to the researcher, they may
be stratified according to known public attitudes or
beliefs
 It may be advantageous to study a broad range of
subjects :
• (maximum variation sample)
• outliers (deviant sample)
• subjects who have specific experiences (critical case
sample)
• subjects with special expertise (key informant sample).
…….
 The iterative process of qualitative study design means
that samples are usually theory driven ( theoretical
sampling) to a greater or lesser extent
Some suggestions of sample size
in qualitative studies
 The smallest number of participants should be 15
 Should lie under 50
 6-8 participants for FGDs AND at least 2 FGDs per
population group
IMPORTANT
 Attainment of saturation
 Justification of choice of number
Sample size determination in
quantitative study
Several criteria will need to be
specified to determine the appropriate
sample size:
Level of precision,
Level of confidence or risk,
Degree of variability in the attributes
being measured ( prevalence)
External validity
…….
 The Level of Precision-sometimes called
sampling error
 range in which the true value of the population is
estimated to be.
 This range is often expressed in percentage points
(e.g., ±5 percent).
 The Confidence Level
 based on ideas encompassed under the Central
Limit Theorem.
 E.g a 95% confidence level is selected, 95 out of
100 samples will have the true population value
within the range of precision
…….
Degree of Variability
refers to the distribution of attributes in
the population.
The more heterogeneous a population,
the larger the sample size required to
obtain a given level of precision.
The less variable (more homogeneous) a
population, the smaller the sample size.
……
 A proportion of 50 % indicates a greater level of
variability than either 20% or 80%. This is because
20% and 80% indicate that a large majority do not or
do, respectively, have the attribute of interest.
 Because a proportion of 0.5 indicates the maximum
variability in a population, it is often used in determining
a more conservative sample size, that is, the sample
size may be larger than if the true variability of the
population attribute were used.
……
 Sample size affects accuracy of representation;
Larger sample means less chance of error
 Minimum suggested sample is 30 and upper limit is
1,000
External validity – how well sample generalizes to the
population, a representative sample is required (not
the same thing as variety in a sample)
Strategies for Determining Sample
Size
There are several approaches to determining the
sample size.
 Using a census for small populations
 Imitating a sample size of similar studies
 Using published tables
 Applying formulas to calculate a sample size
Using a Census for Small Populations
….
 One approach is to use the entire population as
the sample.
 Although cost considerations make this impossible
for large populations.
 Attractive for small populations (e.g., 200 or less).
 Eliminates sampling error and provides data on all
the individuals in the population.
 Some costs such as questionnaire design and
developing the sampling frame are “fixed,” that is,
they will be the same for samples of 50 or 200.
 Finally, virtually the entire population would have
to be sampled in small populations to achieve a
desirable level of precision
Using a Sample Size of a Similar
Study
 Use the same sample size as those of studies
similar to the one you plan( Cite reference).
 Without reviewing the procedures employed in these
studies you may run the risk of repeating errors that
were made in determining the sample size for
another study.
 However, a review of the literature in your discipline
can provide guidance about “typical” sample sizes
that are used.
Using Published Tables
 Published tables provide the sample size for a
given set of criteria.
 Necessary for given combinations of precision,
confidence levels and variability.
 The sample sizes presume that the attributes
being measured are distributed normally or
nearly so.
 Although tables can provide a useful guide for
determining the sample size, you may need to
calculate the necessary sample size for a
different combination of levels of precision,
confidence, and variability.
Sample Size for ±5%, ±7% and ±10% Precision Levels
where Confidence Level Is 95% and P=.5.
Size of
Populatio
n
Sample Size (n) for Precision (e) of:
±5% ±7% ±10%
100 81 67 51
125 96 78 56
150 110 86 61
175 122 94 64
200 134 101 67
225 144 107 70
250 154 112 72
275 163 117 74
300 172 121 76
325 180 125 77
350 187 129 78
375 194 132 80
400 201 135 81
425 207 138 82
450 212 140 82
Using Formulas to Calculate a Sample
Size
 Sample size can be determined by the application of
one of several mathematical formulae.
 Formula mostly used for calculating a sample for
proportions.
For example:
 For populations that are large, the Cochran
(1963:75) equation yields a representative sample
for proportions.
 Fisher equation, Mugenda etc
Cochran equation
Where n0 is the sample size,
Z2 is the abscissa of the normal curve that cuts off an
area α at the tails;
(1 – α) equals the desired confidence level, e.g., 95%);
e is the desired level of precision,
p is the estimated proportion of an attribute that is
present in the population,and q is 1-p.
The value for Z is found in statistical tables which
contain the area under the normal curve. e.g Z = 1.96
for 95 % level of confidence
2
2
2
/
0
e
pq
z
n 

…..
A Simplified Formula For Proportions
 Yamane (1967:886) provides a simplified formula
to calculate sample sizes.
 ASSUMPTION:
 95% confidence level
 P = .5 ;
……..
Where n is the sample size,
N is the population size,
e
is the level of precision.
Finite population correction for
proportions
 With finite populations, correction for
proportions is necessary
 If the population is small then the sample
size can be reduced slightly.
 This is because a given sample size
provides proportionately more information
for a small population than for a large
population.
 The sample size (n0) can thus be adjusted
using the corrected formulae
…..
Where n is the sample size
N is the population size.
no is calculated sample size
for infinite population
Note
 The sample size formulae provide the number of
responses that need to be obtained. Many
researchers commonly add 10 % to the sample
size to compensate for persons that the
researcher is unable to contact.
 The sample size also is often increased by 30 %
to compensate for non-response ( e.g self
administered questionnaires).
Use of software in sample size
determination
Depending on type of study and specific software
Some information will be required:
 Population sample size, population standard
deviation, population sampling error, confidence
level, z –value, power of study etc …
 80% power in a clinical trial means that the study
has a 80% chance of ending up with a p value of
less than 5% in a statistical test (i.e. a statistically
significant treatment effect) if there really was an
important difference (e.g. 10% versus 5%
mortality) between treatments.
Further considerations
 The above approaches to determining sample size
have assumed that a simple random sample is the
sampling design.
 More complex designs, e.g. case control studies etc
, one must take into account the variances of sub-
populations, strata, or clusters before an estimate of
the variability in the population as a whole can be
made.
Estimation
 Inferential statistics is the estimation of the
population parameters from the sample statistics.
 The sample statistics are calculated from the
sample data and the population parameters are
inferred (or estimated) from the sample statistics.
 In estimation, we are concerned with unknown
population parameters such as a population
mean which is unknown but is required
 Such situations force us to take samples, find
sample statistics and use them to infer upon the
unknown population parameters.
 We can estimate an unknown population
parameter in two main ways;
(i) By calculating a point estimate from the samples.
A point estimate is a single value from the sample
such as the sample mean used to estimate an
unknown population parameter such as the
population mean µ.
(ii) You can also calculate an interval estimate
which is a range within which the unknown
population parameter is expected to fall.
 Whether we find a point estimate or an interval
estimate, in both cases, we are trying to find or
estimate the value of an unknown population
parameter. The estimator so found must satisfy
three conditions:
(i) It must be unbiased: The expected value of
the estimator must be equal to the population
parameter,
(ii) Consistent: The value of the estimator
approaches the value of the parameter as the
sample size increases,
(iii) Relatively Efficient: The estimator has the
smallest variance of all estimators which
could be used.
Estimating a population mean
 Consider a population whose mean µ is unknown
as illustrated by Figure below
Note that the large areas is the population while the
smaller areas are samples taken from the
population.
In order to estimate µ, we will need to take samples
from the population and calculate the sample
means
Each sample mean , is trying to estimate µ
individually.
However, note that the best estimate of µ is the
mean of the sample means called the mean of the
x
1 2 3 4
( ... )
n
x
x x x x x
n

   

For large n, 
 
x
(the mean of the sampling distribution of means is equal to the
population mean)
is a point estimate of the population mean μ. It is
called a consistent estimator because its value gets
closer to the population mean μ as the sample size
n increases.
Irrespective of the number of samples under
consideration, a point estimate is likely to be
different from its corresponding population
parameter
It is for this reason that interval estimates are
preferred to point estimates.
x

 An interval estimate is a range within which an
expected value is expected to fall.
 You may be asked to estimate the day when the
first rains will fall this year.
 A point estimate would be to say the first rains will
fall on 12th October.
 We are saying this estimate is unlikely to be
correct however, by giving a range within which
the date of the first rains falls would be a better
estimate of the date when the first rains will fall.
 The wider the range, the more the confidence
that indeed the first rains will fall within that
period.
 I can say, for example that the first rains will fall
between 1st October and 31st January. From
experience, first rains always fall after 1st October
and way before 31st January.
 I can therefore say that I am 100% confident that
the first rains will fall between 1st October and 31st
January.
 You will also agree this level of confidence
because you know very well the rains don’t fall
until way after 1st October and way before 31st
January. If we are 100% confident (probability=1)
then we can represent this on a normal curve.
Figure showing confidence level and Limits
 The level of confidence is called the Confidence
Level (100%) while the dates 1st October and 31st
January are called the Confidence Limits.
 If I shift the confidence limits and ask what your
level of confidence is that the first rains will come
between 1st November and 31st December, you
may not be 100% confident.
 The level of confidence drops because you know
there are many years when the first rains have
come in October.
 You also know that first rains have sometimes
come as late as after Christmas.
 Your level of confidence may therefore be say
96%. The 4% is the likelihood that you are wrong,
that the first rains could come before 1st
November and after 31st December.
Figure showing 96% confidence level, confidence
limits and confidence interval for the day the first
rains will fall in Malawi.
 The above approach is also true for any unknown
population parameter such as the population
mean μ.
 A confidence interval is an interval estimate with a
specific level of confidence.
 A level of confidence is the probability that the
interval estimate will contain the parameter. In
other words, it is the percent of the time the true
mean will lie in the interval estimate given.
 The confidence interval is therefore a range
within which an unknown population parameter is
expected to fall.
 The confidence limits are values within which the
level of confidence is declared.
 For the estimation of μ, the sample means
will be different from each other and also from their
mean
As a result, the sample means will have a standard
deviation about them
This standard deviation is called the standard error
(SE).
The Central Limit Theorem states that irrespective
of the distribution of the parent population (whether
normal or not), the sampling distribution of means
will be normally distributed (i.e. the sample means
n
x
x
x
x
x ...
,
,
, 4
3
2
1
x

x

x

x

 known as the standard error.
 Figure showing sample means normally
distributed.
 This means we can use the normal distribution
tables to determine the probability of value
(sample mean) having any value of interest
provided we know the mean of the distribution
(mean of sample means, ) and the standard
deviation of the distribution,
x

x

 The standard error of the mean (SEM) is the standard
deviation of the sampling distribution of means. It can
also be viewed as the standard deviation of the error in
the sample mean relative to the true mean, since the
sample mean is an unbiased estimator of μ.
 SEM is usually estimated by the population standard
deviation divided by the square root of the sample size:
n
or
n
SD x
x





 Where;
 σ is the standard deviation of the population.
 n is the size (number of observations) of the
sample.
 Figure showing the 95% confidence interval
estimate for μ.
 We are 95% confident that the population mean
value from which the sample was taken falls
somewhere between X1 and X2.
 Any confidence interval is given with a level of
confidence which is given in percentage terms.
You can have a 95% confidence interval or a 99%
confidence interval or indeed any level of
confidence. The level of confidence determines
the number of standard deviations from the mean
(Z) any sample mean value is from
 The value of Z is obtained from tables. For 95%
level of confidence, the area at the centre is 0.95.
x

 You need to search for
Inside the tables to be able to read the
corresponding Z value. The value of Z for
area=0.025 is ±1.96. The confidence interval is
from X1 and X2, the 95% confidence interval is
shown below.
025
.
0
2
%
5

 The position X1 is 1.96 standard deviations
(standard errors) less than the mean i.e
.
 Similarly, the position on Figure 3.6 X2 is 1.96
standard errors more than the mean. This
therefore means that;
 The 95% confidence interval
 Since we know that
Then the 95% confidence interval for μ
x
 x
x
X 
 96
.
1
1 

x
x
x
x to
X 


 96
.
1
96
.
1
1 


n
x

 
n
to
n
X x
x



 96
.
1
96
.
1
1 


 The 95% confidence interval for μ is therefore,
 Similarly, we can find the 99%, 98% 90% etc
confidence intervals for the population mean
given data from samples taken from that
population.
n
X x

 96
.
1
1 

 Example
 As part of a malaria control programme it was
planned to spray all 10 000 houses in a rural area
with insecticide and it was necessary to estimate
the amount that would be required. Since it was
not feasible to measure all houses, a random
sample of 100 houses was chosen and the
sprayable surface of each of the these was
measured.
 The mean sprayable surface area for these 100
houses was 23.2 m2 and the standard deviation
was 5.9m2.
(a)Calculate the standard error about the estimate
of the population mean .
(b) What is a standard error?
(c)The 95% confidence interval of the population
mean was 22.0m2 to 24.4m2, what is a confidence
interval?
(d) What is the difference between a standard error
and a standard deviation?
SOLUTION
(a) The standard error of the population mean µ
(b) A standard error is the standard deviation of the
sampling distribution of means. It is related to the
population standard deviation in this way
(c) A confidence interval is a range within which an
unknown population parameter is expected to fall.
n
x

 
2
59
.
0
100
9
.
5
m
n
x 




n
x

 
(d) A standard error is the standard deviation of the
sampling distribution of means which means on
average how far away from their mean sample
means are on average while a standard deviation is
a measure of how far away from a mean a set of
data is on average.
Estimating a population
proportion
 One of the population parameters that need to be
estimated is the population proportion p.
 If a population proportion such as the prevalence
of a disease in the entire population is unknown,
it may be estimated through sampling the
population as discussed.
 The sample statistics are the best estimates of
the unknown population proportion. The
population proportion, ρ can be estimated from
the sample proportion p.
 The 95% confidence interval for the population
proportion ρ is given by;
 95% Confidence interval for ρ
 Note that SE(ρ) is given by
n
p
p
z
p
)
1
(
2
/


 
n
p
p )
1
( 
 Example
A health survey was carried out in Mangochi urban
in 2014 among 123 adults chosen at random. The
survey, among other things asked respondents
when they last visited a sing’anga (an African
medicine man). The answers revealed that 34 of
them had not visited a sing’anga for over 2 years.
(a) Calculate an estimate proportion of adults who
had not visited a sing’anga for over two years.
(b) Find the 95% confidence intervals for the
proportion adults who had not visited a sing’anga.
What is the meaning of this confidence interval to
you?
(c)If a narrower confidence interval of this
proportion was required, what would you
recommend to the researchers?
(d)What percentage of adults in Mangochi had
visited a sing’anga in the past 2 years? Calculate
the 98% confidence interval about this proportion.
Solutions
a) Proportion of adults who had not visited a
sing’anga for the past two years.
b) Find the 95% confidence intervals for the
proportion adults who had not visited a sing’anga.
What is the meaning of this confidence interval to
you?
%
64
.
27
2764
.
0
123
34
or
p 

the 95% Confidence interval for
3554
.
0
1972
.
0
123
2764
.
0
1
(
2764
.
0
96
.
1
2764
.
0
)
1
(
2
/
to
p
p
n
p
p
z
p
p






 
 It means that we have observed from the sample
that 27.64% of adults did not visit a sing’anga for
the past two years but if we were to deal with the
whole population, the proportion of adults who
would not have visited a sing’anga would be
somewhere between 19.72% and 35.54%
(c) If a narrower confidence interval of this
proportion was required, what would you
recommend to the researchers?
In order to narrow the confidence interval (i.e. a
more precise estimate) you need to increase the
sample size.
(d) What percentage of adults in Mangochi had
 The percentage of adults who had visited a
sing’anga
DATA COLLECTION AND
MANAGEMENT
 Data collection is a major part of the
research process.
 Methods and instruments for data
collection must be chosen according to
the nature of the problem, approach to
the solution and variables being studied
Qualitative Data Collection
Methods
1. Collecting verbal data
 Verbal data primarily consist of words
resulting from various methodological
approaches which are common that
research participants speak about such
as events, experiences, practices, and
so on.
 This is achieved through interviews,
focus group discussions and narratives
The three main methods of data
collection
1. In-depth interviews (IDIs)
Interviewing is often used in
qualitative studies to elicit
meaningful data. In interviews, the
interviewer writes down responses
verbatim or uses a tape-recorder for
later transcription.
 IDIs in qualitative research encourage
subjects to express their views at length.
 The respondent is usually interviewed at
a place convenient to them.
 An interview schedule, sometimes called
an interview guide, is a list of topics
administered to subjects by a skilled
interviewer.
 The researcher may be able to obtain
more detailed information from each
participant, but loses the richness that
can arise in a group (FGD) in which
people debate issues and exchange
views.
Example:
Please describe your experiences on
the day you were discharged from the
hospital.
 The interview helps reveal more about
beliefs and attitudes and behaviour
according to the respondent.
 IDIs normally use open-ended
questions which permit free responses
which should be recorded in the
respondents’ own words. Such
questions are useful for obtaining in-
depth information on:
 1. Facts with which the researcher is not
very familiar,
 2. Opinions, attitudes and suggestions of
informants,
 3. Sensitive issues.
 In order to have quality data with open
ended questions there is need to
1. Thoroughly train and supervise the
interviewers or select experienced
research assistants.
2. Prepare a list of further questions to
keep at hand to use to ‘probe’ for
answer(s) in a systematic way
3. Pre-test open-ended questions and, if
possible, pre-categorise the most
common responses, leaving enough
space for other answers.
2. Semi-structured interviews
 Semi-Structured Interviews allow
participants to provide specific answers
to questions in their own words. When
open-ended questions are included in
the data collection tool, respondents
must write out their responses.
 The focus of the interview is decided by
the researcher and there may be areas
the researcher is interested in exploring.
 The researcher tries to build a rapport
with the respondent and the interview is
like a conversation.
(FGDs)
 For this method the researcher brings
together a small number of subjects
usually between 6 and 12 to discuss the
topic of interest.
 The group size is kept deliberately
small, so that its members do not feel
intimidated but can express opinions
freely.
 The small number of participants also
makes discussion manageable by the
 However, very few participants may
result in an inadequate discussion and
too many may lead to social loafing by
others.
 A focus group questionnaire is called a
"discussion guide", and is more of a
check list of questions than a fully
structured questionnaire.
 This is because the trick with focus
groups is to put the group firmly in
 The use of purposive sampling is most
often employed when individuals known
to have a desired expertise are sought.
Direct observation
 Data can be collected by an external
observer, referred to as a non-
participant observer.
 Or the data can be collected by a
participant observer, who can be a
member of staff undertaking usual
duties while observing the processes of
care.

 In this type of study the researcher aims
to become immersed in or become part
of the population being studied, so that
they can develop a detailed
understanding of the values and beliefs
held by members of the population.
 Sometimes a list of observations the
researcher is specifically looking for is
prepared before-hand, other times the
observer makes notes about anything
they observe for analysis later.
Quantitative Data Collection
Method
1. Questionnaires
 A questionnaire is an instrument with
closed questions or statements to which
a respondent must react.
 Close-ended questionnaires ask
subjects to select an answer from
among several choices.
 The alternatives may range from a
simple ‘yes’ or ‘no’ to complex
expressions of opinion.
 Examples
1.Have you been hospitalized as an
inpatient at any time in the past 5 years?
a. Yes
b. No
2. How important is it to you to avoid a
pregnancy at this time?
a. Extremely important
b. Very important
c. Somewhat important
d. Not important
Scales
 A scale is a set of numerical values
assigned to responses, representing the
degree to which subjects possess a
particular attitude, value or
characteristic.
 Likert Scales
 Likert scales, also called summative
scales, require subjects to respond to a
series of statements to express a
viewpoint.
 Subjects read each statement and
select an appropriately ranked
response.
 Response choices commonly address
agreement, evaluation, or frequency.
 Likert’s original scale included five
agreement categories: “strongly agree
(SA), “ agree (A)”, “uncertain (U,”)
“disagree (D),” and strongly disagree
(SD).”
 The number of categories in the Likert
scale can be modified: it can be
extended to seven categories (by adding
“somewhat disagree” and “somewhat
agree”) or reduced to four categories (by
eliminating “uncertain”).
For example:
 What is your opinion on the following
statement?
‘Women who have induced abortion
should be severely punished.’
Data Management
 Data management consists of those
activities aimed at achieving a
systematic, coherent manner of data
collection, storage and retrieval.
 How data are stored and retrieved is at
the heart of data management.
 A good storage and retrieval system is
critical for keeping track of what data are
available, for permitting easy, flexible,
reliable use of data and for documenting
the analysis made so that the study can,
in principle, be verified or replicated.
 A system for storage and retrieval
should be designed prior to the actual
data collection.
 In data management, you may consider
some of the following points or
questions:
 The principal investigator is responsible
for ensuring that data are of high quality
by, for example, completely checking a
subset of all completed interviews.
 Data organization: How will you name
your data files? How will you organize
your data into folders?
 Access & security: Who will have access
to your data? If the data is sensitive,
how will you protect it from unauthorized
access?
 Storage: Where will your data be
stored?
 Backups: This is probably the single
most important item on this list. Hard
drives on desktop and laptop computers
fail regularly. You must have a credible
backup strategy of regular backups, and
of course you must then follow it.
Consider including an off-site backup so
that your data will not be lost if your
building burns down or if your computer
is stolen. Rather than relying on
memory, consider an automated backup
 A large amount of qualitative data can
be stored on computers using a variety
of available computer applications.
Therefore, gaining as much knowledge
as possible about computer programs is
critical.
 It is recommended that original data be
preserved for not less than a period of 5
years, as there is reasonable
expectation that the original data will
continue to be the basis of ongoing

Mais conteúdo relacionado

Semelhante a Statistics-1.ppt

QUESTION 1Question 1 Describe the purpose of ecumenical servic.docx
QUESTION 1Question 1 Describe the purpose of ecumenical servic.docxQUESTION 1Question 1 Describe the purpose of ecumenical servic.docx
QUESTION 1Question 1 Describe the purpose of ecumenical servic.docxmakdul
 
Medical Statistics Part-I:Descriptive statistics
Medical Statistics Part-I:Descriptive statisticsMedical Statistics Part-I:Descriptive statistics
Medical Statistics Part-I:Descriptive statisticsRamachandra Barik
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statisticsalbertlaporte
 
Medical Statistics.ppt
Medical Statistics.pptMedical Statistics.ppt
Medical Statistics.pptssuserf0d95a
 
Data Presentation and Slide Preparation
Data Presentation and Slide PreparationData Presentation and Slide Preparation
Data Presentation and Slide PreparationAchu dhan
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics Bahzad5
 
Measures of central tendency and dispersion mphpt-201844
Measures of central tendency and dispersion mphpt-201844Measures of central tendency and dispersion mphpt-201844
Measures of central tendency and dispersion mphpt-201844MtMt37
 
3. measures of central tendency
3. measures of central tendency3. measures of central tendency
3. measures of central tendencyrenz50
 
Summary statistics
Summary statisticsSummary statistics
Summary statisticsRupak Roy
 
Confidence Intervals in the Life Sciences PresentationNamesS.docx
Confidence Intervals in the Life Sciences PresentationNamesS.docxConfidence Intervals in the Life Sciences PresentationNamesS.docx
Confidence Intervals in the Life Sciences PresentationNamesS.docxmaxinesmith73660
 
Data Display and Summary
Data Display and SummaryData Display and Summary
Data Display and SummaryDrZahid Khan
 
Chapter 4 MMW.pdf
Chapter 4 MMW.pdfChapter 4 MMW.pdf
Chapter 4 MMW.pdfRaRaRamirez
 
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docx
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL                    .docxRunning Head SCENARIO NCLEX MEMORIAL HOSPITAL                    .docx
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docxtoltonkendal
 

Semelhante a Statistics-1.ppt (20)

Statistics
StatisticsStatistics
Statistics
 
Basic statistics
Basic statisticsBasic statistics
Basic statistics
 
Intro to Biostat. ppt
Intro to Biostat. pptIntro to Biostat. ppt
Intro to Biostat. ppt
 
QUESTION 1Question 1 Describe the purpose of ecumenical servic.docx
QUESTION 1Question 1 Describe the purpose of ecumenical servic.docxQUESTION 1Question 1 Describe the purpose of ecumenical servic.docx
QUESTION 1Question 1 Describe the purpose of ecumenical servic.docx
 
Statistics
StatisticsStatistics
Statistics
 
Medical Statistics Part-I:Descriptive statistics
Medical Statistics Part-I:Descriptive statisticsMedical Statistics Part-I:Descriptive statistics
Medical Statistics Part-I:Descriptive statistics
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statistics
 
Ch 3 DATA.doc
Ch 3 DATA.docCh 3 DATA.doc
Ch 3 DATA.doc
 
How to describe things
How to describe thingsHow to describe things
How to describe things
 
Medical Statistics.ppt
Medical Statistics.pptMedical Statistics.ppt
Medical Statistics.ppt
 
Data Presentation and Slide Preparation
Data Presentation and Slide PreparationData Presentation and Slide Preparation
Data Presentation and Slide Preparation
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
Measures of central tendency and dispersion mphpt-201844
Measures of central tendency and dispersion mphpt-201844Measures of central tendency and dispersion mphpt-201844
Measures of central tendency and dispersion mphpt-201844
 
3. measures of central tendency
3. measures of central tendency3. measures of central tendency
3. measures of central tendency
 
Summary statistics
Summary statisticsSummary statistics
Summary statistics
 
Confidence Intervals in the Life Sciences PresentationNamesS.docx
Confidence Intervals in the Life Sciences PresentationNamesS.docxConfidence Intervals in the Life Sciences PresentationNamesS.docx
Confidence Intervals in the Life Sciences PresentationNamesS.docx
 
Data Display and Summary
Data Display and SummaryData Display and Summary
Data Display and Summary
 
Chapter 4 MMW.pdf
Chapter 4 MMW.pdfChapter 4 MMW.pdf
Chapter 4 MMW.pdf
 
Unit 3 Sampling
Unit 3 SamplingUnit 3 Sampling
Unit 3 Sampling
 
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docx
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL                    .docxRunning Head SCENARIO NCLEX MEMORIAL HOSPITAL                    .docx
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docx
 

Mais de GabrielMDOTHI

Nthondo_PDF.pdf for Malawian secondary school s
Nthondo_PDF.pdf for Malawian secondary school sNthondo_PDF.pdf for Malawian secondary school s
Nthondo_PDF.pdf for Malawian secondary school sGabrielMDOTHI
 
VULVAR CANCER group 4.pptx
VULVAR CANCER group 4.pptxVULVAR CANCER group 4.pptx
VULVAR CANCER group 4.pptxGabrielMDOTHI
 
MANAGEMENT FUNCTIONS ORGANIZING 2019.ppt
MANAGEMENT FUNCTIONS ORGANIZING 2019.pptMANAGEMENT FUNCTIONS ORGANIZING 2019.ppt
MANAGEMENT FUNCTIONS ORGANIZING 2019.pptGabrielMDOTHI
 
MANAGEMENT_FUNCTIONS-LEADING.ppt
MANAGEMENT_FUNCTIONS-LEADING.pptMANAGEMENT_FUNCTIONS-LEADING.ppt
MANAGEMENT_FUNCTIONS-LEADING.pptGabrielMDOTHI
 
Introduction to Anaesthesia.pptx
Introduction to Anaesthesia.pptxIntroduction to Anaesthesia.pptx
Introduction to Anaesthesia.pptxGabrielMDOTHI
 
GROUP 9 DEPRESSION.pptx
GROUP 9 DEPRESSION.pptxGROUP 9 DEPRESSION.pptx
GROUP 9 DEPRESSION.pptxGabrielMDOTHI
 
COMMON MENTAL HEALTH PROBLEMS.pptx
COMMON MENTAL HEALTH PROBLEMS.pptxCOMMON MENTAL HEALTH PROBLEMS.pptx
COMMON MENTAL HEALTH PROBLEMS.pptxGabrielMDOTHI
 
Concepts of CHN (3).pptx
Concepts of CHN (3).pptxConcepts of CHN (3).pptx
Concepts of CHN (3).pptxGabrielMDOTHI
 
BIPOLAR AFFECTIVE DISORDERS(1).pptx
BIPOLAR AFFECTIVE DISORDERS(1).pptxBIPOLAR AFFECTIVE DISORDERS(1).pptx
BIPOLAR AFFECTIVE DISORDERS(1).pptxGabrielMDOTHI
 
group 4 fundamentals of mental health.pptx
group 4 fundamentals of mental health.pptxgroup 4 fundamentals of mental health.pptx
group 4 fundamentals of mental health.pptxGabrielMDOTHI
 
FETAL SKULL ANATOMY.pptx
FETAL SKULL ANATOMY.pptxFETAL SKULL ANATOMY.pptx
FETAL SKULL ANATOMY.pptxGabrielMDOTHI
 

Mais de GabrielMDOTHI (15)

Nthondo_PDF.pdf for Malawian secondary school s
Nthondo_PDF.pdf for Malawian secondary school sNthondo_PDF.pdf for Malawian secondary school s
Nthondo_PDF.pdf for Malawian secondary school s
 
VULVAR CANCER group 4.pptx
VULVAR CANCER group 4.pptxVULVAR CANCER group 4.pptx
VULVAR CANCER group 4.pptx
 
MANAGEMENT FUNCTIONS ORGANIZING 2019.ppt
MANAGEMENT FUNCTIONS ORGANIZING 2019.pptMANAGEMENT FUNCTIONS ORGANIZING 2019.ppt
MANAGEMENT FUNCTIONS ORGANIZING 2019.ppt
 
MANAGEMENT_FUNCTIONS-LEADING.ppt
MANAGEMENT_FUNCTIONS-LEADING.pptMANAGEMENT_FUNCTIONS-LEADING.ppt
MANAGEMENT_FUNCTIONS-LEADING.ppt
 
Probability.pptx
Probability.pptxProbability.pptx
Probability.pptx
 
Introduction to Anaesthesia.pptx
Introduction to Anaesthesia.pptxIntroduction to Anaesthesia.pptx
Introduction to Anaesthesia.pptx
 
GROUP 2KCN.pptx
GROUP 2KCN.pptxGROUP 2KCN.pptx
GROUP 2KCN.pptx
 
GROUP 9 DEPRESSION.pptx
GROUP 9 DEPRESSION.pptxGROUP 9 DEPRESSION.pptx
GROUP 9 DEPRESSION.pptx
 
COMMON MENTAL HEALTH PROBLEMS.pptx
COMMON MENTAL HEALTH PROBLEMS.pptxCOMMON MENTAL HEALTH PROBLEMS.pptx
COMMON MENTAL HEALTH PROBLEMS.pptx
 
Concepts of CHN (3).pptx
Concepts of CHN (3).pptxConcepts of CHN (3).pptx
Concepts of CHN (3).pptx
 
BIPOLAR AFFECTIVE DISORDERS(1).pptx
BIPOLAR AFFECTIVE DISORDERS(1).pptxBIPOLAR AFFECTIVE DISORDERS(1).pptx
BIPOLAR AFFECTIVE DISORDERS(1).pptx
 
group 4 fundamentals of mental health.pptx
group 4 fundamentals of mental health.pptxgroup 4 fundamentals of mental health.pptx
group 4 fundamentals of mental health.pptx
 
FETAL SKULL ANATOMY.pptx
FETAL SKULL ANATOMY.pptxFETAL SKULL ANATOMY.pptx
FETAL SKULL ANATOMY.pptx
 
MASTECTOMY.pptx
MASTECTOMY.pptxMASTECTOMY.pptx
MASTECTOMY.pptx
 
ANAEMIA.pptx
ANAEMIA.pptxANAEMIA.pptx
ANAEMIA.pptx
 

Último

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 

Último (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 

Statistics-1.ppt

  • 1. Matthews Lazaro MSc Biostatistics DESCRIPTIVE STATISTICS KAMUZU COLLEGE OF NURSING
  • 2. Basic Definitions  Statistics is the science that deals with the collection, classification, analysis, interpretation and presentation of numerical facts or data.  Data Collection Sources of data are many, the clinical area is one where measurements from patients could be a data source. There are variables that could be measured such as length of stay in the ward for patients, age of patients, types of diseases or conditions, distance travelled to the health facility etc. For example
  • 3. Example  The following data could be collected from under- five children ward on the length of stay by patients ; 2 days, Brown; 7 days Black, 0.5 days
  • 4. Sample and Population Symbols As we progress in this course there will be different symbols that represent the same thing. The only difference is that one comes from a sample and one comes from a population.
  • 5. Symbols under this topic Sample Mean: Sample variance :s2 Sample Standard Deviation:s Population Mean: Population variance: σ2 Population Standard deviation:σ  x
  • 6. Classification  Normally when data is collected, it is raw i.e. it is not processed.  For example the data collected on length of stay in the under-five ward is raw data.  One can present this data in groups called classes e.g. 0 - 5 days, 6-10 days, 11-15 days etc  Each class will have corresponding frequencies  Data presented in classes and corresponding frequencies is called frequency distribution.
  • 7. Example No. of days in Ward in days (Class) No. of patients (Frequency) 0-5 1 6-10 8 11-15 15 16-20 9 21-25 5 26-30 2 Total 40
  • 8.  This data needs to be analyzed and presented in a form that could easily be understood by most people who may not know the intricacies of data analysis
  • 9. Interpretation  Data analysis and interpretation is the process of assigning meaning to the collected information and determining the conclusions, significance, and implications of the findings.  In a situation where there has been an intervention, the purpose of the data analysis and interpretation phase is to transform the data collected into credible evidence about the performance of say an intervention.
  • 10.  For the frequency distribution above, the analysis and interpretation of measures of central tendency such as the mean, measures of spread such as the standard deviation etc
  • 11. Presenting data in diagrams and charts  Quantitative data is usually presented in figures and tables (a) Bar Chart  Used for discrete data. The categories on the x- axis are not linked. Table 1 shows hypothetical colours of eyes for patients in a hospital. Table 1: Frequency Distribution of eyes Colour of eyes No of Patients Black 11 White 3 Red 14 Brown 25 Blue 5
  • 13. Pie Chart  A pie chart (or a circle chart) is a circular statistical graphic, which is divided into sectors to illustrate numerical proportion.  The Pie Chart may be used for both continuous as well as discrete data.
  • 15. (c) Histogram  A Histogram is a graphical display of data using bars of different heights. It is similar to Bar Chart only that a Histogram is used to display continuous data and hence the bars touch each other.  A histogram is a very important chart and is used in many situation in statistics hence details of its construction are discussed in later sections but basically a histogram looks as in Figure 2
  • 17. Types of data  Data refers to the information that has been collected from an experiment or a survey/research, or some historical record.  Collected statistical data falls into one of two categories, discrete data or continuous data  Discrete data is a set of data values which occupies only whole number values, often a count or score Example;  number of patients admitted in a ward etc
  • 18.  Continuous data is any data that has infinite values with connected data points, often a measurement.  Continuous data will occupy both whole number as well as fractional parts.  Examples of continuous data include; height of a person (e.g. 1.72m; 1 is the whole number part while 0.72 is the fractional part), baby birth weight, distance covered in a race etc.
  • 19.  Data that is collected may be presented raw or grouped  As an example, 100 birth weights for babies born at a clinic in Chiradzulu were presented raw as follows; 3.1 3.3 1.3 2.9 2.2 3.4 4.1 5.1 4.9 4.0 5.2 1.8 2.1 3.2 2.2 3.3 2.4 3.4 2.5 3.1 2.6 3.2 2.7 4.0 3.3 2.8 4.1 1.1 2.9 3.5 4.2 1.9 3.6 3.0 2.1 2.2 3.8 2.3 3.4 4.6 4.7 3.4 3.5 3.7 3.8 2.7 2.9 2.8 3.1 3.3 3.4 2.6 3.5 4.8 4.6 4.3 2.6 3.2 2.7 4.0 3.3 2.8 4.1 1.1 2.9 3.5 4.2 1.9 3.6 3.0 2.1 2.2 3.1 3.3 1.3 2.9 2.2 3.4 4.1 5.1 4.9 4.0 5.2 1.8 2.1 3.2 2.2 3.3
  • 20.  We can organize this data into five classes as shown in Table 1; Class Frequency 1.1-2.0 9 2.1-3.0 33 3.1-4.0 38 4.1-5.0 16 5.1-6.0 4 Total 100
  • 21.  Although the baby weights are presented to one place of decimal, it is possible that some of the weights were accurate to two places of decimal  Suppose a baby’s weight were 3.06kg in which class would we place that weight?  It would not be in the class 2.1 – 3.0 because 3.06 is larger than 3.0. It would also not be in the class 3.1 – 4.0 because 3.06 is less than 3.1
  • 22.  This therefore means than the classes above have gaps in them to which we would have many babies unrecorded.  The classes with gaps are called class limits  In order to eliminate the gaps between the classes we introduce what are called Class Boundaries  we firstly identify the gap between the classes in the Class Limits  In the case above, the gaps are 0.1 each i.e. from 3.0 to in the second class to 3.1 in third class, the difference is 0.1
  • 23.  If you divide this gap by 2 and use that to stretch each class you end up with class boundaries  For example 0.1/2=0.05 Then the class 2.1-3.0 will be stretched by 0.05 resulting into 2.05-3.05 The next class will be 3.05-4.05 and so on
  • 24. Table Class Boundaries Frequency 1.05-2.05 9 2.05-3.05 33 3.05-4.05 38 4.05-5.05 16 5.05-6.05 4 Total 100
  • 25.  The value that is at the centre of the Class Boundary is called the Class Mid-point such that; int 2 Upper Class Boundary Lower Class Boundary Class Mid Po   
  • 26. Descriptive Statistics  Descriptive statistics are numbers or data that are used to summarize and describe data.  Descriptive statistics tend to summarize a sample in order to get an idea about the population  The main features of the sample are also the main features of a population.
  • 27. Measures of Central Tendency  A measure of central tendency is a value used to represent the typical or “average” value in a data set  There are 4 values that are considered measures of the center. 1. Mean 2. Median 3. Mode
  • 28. Measures of Central Tendency for raw data  Suppose you are weighing babies born at your clinic somewhere in Malawi, and the baby weights (in kg) of the first 10 babies were as follows: 2.7, 3, 3.0, 4.1, 5.2, 1.9, 2.3, 3.0.3.3, 3.0 What single figure could represent the baby weights at this clinic? Lets see how different measures of central tendency are computed.
  • 29. The mode  The mode is the data value or datum (or value) which appears the largest number of times in the set or the most frequently occurring figure in the set  If no data value is repeated, we say there is no mode. Using the following data set; 2.7kg, 3.4kg, 3.0kg, 4.1kg, 5.2kg, 1.9kg, 2.3kg, 3.0kg, 3.3kg, 3.0kg. The mode is 3.0kg (highest frequency)
  • 30. The Median  The median is defined as the middle figure after the data set is ranked or placed in order of magnitude. Example 22, 29, 35, 24, 26, 15, 28, 36, 45, 21, 33, 5, 46, 21, 19, 41, 5, 84, 58, 63, 5, 23 Find the median. Solution Rank the data in ascending order 5, 5, 5, 15, 19, 21, 21, 22, 23, 24, 26, 28, 29, 33, 35, 36, 41, 45, 46, 58, 63, 84
  • 31.  The pick the two middle numbers (because the set is even) 5, 5, 5, 15, 19, 21, 21, 22, 23, 24, 26, 28, 29, 33, 35, 36, 41, 45, 46, 58, 63, 84  The two middle figures are 26 and 28. The average of these two figures is the median i.e. (26+28)/2 = 27 is the median.
  • 32. The Arithmetic Mean  The Arithmetic Mean is the sum of all data values divided by the number of values in the data set  The mean of a sample data set is denoted by ..  The mean of a population data set is denoted by .. x 
  • 33.  Mean is given by 1 n i i x x n    Where n is number of observation, i runs from 1 to n Example Use the following data set to compute a sample mean 1,65kg, 3.3kg, 4.1kg, 3.0kg, 3.1kg 2.9kg 2.8kg, 3.2 kg, 3.0kg, 3.0kg
  • 34. 1.65 3.3 4.1 3 3.1 2.9 2.8 3.2 3 3 x 3.005 10 kg           
  • 35. Measures of Central Tendency for grouped data The Mode  When data is presented in a frequency distribution, the mode is not found by inspection. The mode for grouped data may be found by using two methods: (a) Graphically (b) analytical (use of a formula) Finding the Mode graphically  Consider the weights of the 100 babies born at Mbulumbuzi Health Centre.
  • 36. Worked Example Class Limits Class Boundaries Frequency (f) 1.10-1.50 1.05-1.55 1 1.60-2.00 1.55-2.05 10 2.10-2.50 2.05-2.55 14 2.60-3.00 2.55-3.05 21 3.10-3.50 3.05-3.55 30 3.60-4.00 3.55-4.05 13 4.10-4.50 4.05-4.55 6 4.60-5.00 4.55-5.05 3 5.10-5.50 5.05-5.55 2 Table Frequency distribution with Class Boundaries. The class boundaries are plotted on the x – axis while on the y – axis the class frequencies are plotted.
  • 37. Figure…… weights of the babies born at Mbulumbuzi Health Centre.
  • 38.  How to determine the mode. 1st step ; identify the modal class (3.05-3.55) 2nd step; identify the frequency of the class before and after the modal class on the chart (2.55- 3.05 and 3.55 – 4.05) These should be identified on the chart as shown in the subsequent figures
  • 40.  In Figure 1.1, the frequency for the class before the modal class is represented by the point A (corner), The frequency for the modal class is represented by the positions B and C and the frequency for the class after the modal class is represented by the point D.  Note that if the frequency of the class before the modal class is higher than that of the class after the modal class, the position (value of the Mode) of the mode is closer to the lower class boundary of the modal class as is the case in Figure 1.2,
  • 41. Finding the Mode analytically 1 1 2 * D Mode L C D D          Where; L : is the lower class boundary of the modal class, D1: is the frequency of the class before the modal class, D2: is the frequency of the class after the modal class and C : is the class width of the modal class.
  • 42. The Median  Definition – the median is the value which separates the largest 50% of data values from the lowest 50% or the middle value after the data is ranked.  Just like the mode, the median may be found using two main methods; i.e. a. Graphically b. Analytical (use of a formula)
  • 43.  Table …… Class Limits Class Boundaries Frequency (f) “Or less” Cumulative frequency “Or more” Cumulative frequency 1.10-1.50 1.05-1.55 1 0 100 1.60-2.00 1.55-2.05 10 1 99 2.10-2.50 2.05-2.55 14 11 89 2.60-3.00 2.55-3.05 21 25 75 3.10-3.50 3.05-3.55 30 46 54 3.60-4.00 3.55-4.05 13 76 24 4.10-4.50 4.05-4.55 6 89 11 4.60-5.00 4.55-5.05 3 95 5 5.10-5.50 5.05-5.55 2 98 2 >5.55 100 0
  • 44. Finding the Median graphically  We shall first look at a new frequency distribution called the cumulative frequency distribution. This where the class frequencies are cumulated from 0 to the total frequency or ∑f or from ∑f to 0.
  • 45. How to compute the cumulative frequencies The less cumulative frequency  1st step: By asking questions about the lower class boundary as follows; How many people had a value of 1.05 or less? The answer is zero (0)  2nd step: By asking questions about the upper class boundary as follows; How many people had a value of 1.55 or less? The answer is one (1) which is the frequency for the class 0.95 – 1.55
  • 46. 3rd step: Next is how many people had values of 2.05 or less? Answer is 11 which is the 10 in the class 1.55 – 2.05 and the 1 in the class 0.95 – 1.55. You continue like that!!!! The “or more” cumulative frequency distribution is found in a similar manner. The cumulative frequency distribution is used to plot a chart called the Ogive or the Cumulative Frequency Curve.
  • 48.  In this case there were 100 babies, so the value of the 50th baby can be read on the x-axis which is the Median.
  • 49. Finding the Median analytically  The Median is found by; 2 * b N Cf Median L C f              L : is the lower class boundary of the median class, N : is the total frequency, f : is the frequency of the median class Cfb : is the cumulative frequency of the class before the median class and C : is the class width of the median class. …
  • 50.  The Median class is the class in which the median will be found.  It is the class in which the half-way member is  It can be found by using the cumulative frequencies to identify where the half-way member is.
  • 51. The arithmetic mean  For grouped data, the arithmetic mean has to take into consideration the frequencies as well as the class size.  For each class, the value that represents the class is the class midpoint.  This value will be the one which now will have the stated frequency
  • 52. Table …… Class Limits Class Boundaries Midpoint (x) frequency (f) fx 1.10-1.50 1.05-1.55 1.3 1 1.3 1.60-2.00 1.55-2.05 1.8 10 18 2.10-2.50 2.05-2.55 2.3 14 32.2 2.60-3.00 2.55-3.05 2.8 21 58.8 3.10-3.50 3.05-3.55 3.3 30 99 3.60-4.00 3.55-4.05 3.8 13 49.4 4.10-4.50 4.05-4.55 4.3 6 25.8 4.60-5.00 4.55-5.05 4.8 3 14.4 5.10-5.50 5.05-5.55 5.3 2 10.6 Σƒ=100 Σfx=309.5
  • 53.  Class midpoint (x )= (Upper class boundary + Lower class boundary)/2  The total (sum of values is obtained by adding up the fx column  The mean for grouped data is obtained by dividing this total by the sum of frequencies.  Arithmetic mean for grouped data is given by fx x f   
  • 54.  For the data above, 095 . 3 100 5 . 309      f fx x
  • 56. Dispersion The measure of the spread or variability No Variability – No Dispersion
  • 57. Measures of Variation There are 2 values used to measure the amount of dispersion or variation. (The spread of the group) 1. Range 2. Standard Deviation
  • 58. Why is it Important? You want to choose the best brand of medicine for your patients. You are interested in how long the drugs take to cure a disease. The choices are narrowed down to 2 different drugs. The results are shown in the chart. Which drug would
  • 59. The chart indicates the number of days a drug takes to cure a particular disease. Drug A Drug B 10 35 60 45 50 30 30 35 40 40 20 25 210 210
  • 60. Does the Average Help? Drug A: Avg = 210/6 = 35 days Drug B: Avg = 210/6 = 35 days They both last 35 days to cure a disease. No help in deciding which to buy.
  • 61. Consider the Spread Drug A: Spread = 60 – 10 = 50 days Drug B: Spread = 45 – 25 = 20 days Drug B has a smaller variability which means that it performs more consistently. Choose drug B.
  • 62. Range The range is the difference between the lowest value in the set and the highest value in the set. Range = High # - Low #
  • 63. Example Find the range of the data set. 40, 30, 15, 2, 100, 37, 24, 99 Range = 100 – 2 = 98
  • 64. Deviation from the Mean  A deviation from the mean, x – x bar, is the difference between the value of x and the mean x bar. We base our formulas for variance and standard deviation on the amount that they deviate from the mean.
  • 65. Formulae for sample and population variances Definition /Computation formula Machine Formulae 1 ) ( 2 2 2      n n x x s 2 2 1 ( ) 1 n i i x x S n      2 2 1 ( ) N i i x N       2 2 2 ( ) i x x N N     
  • 66. Standard Deviation The standard deviation is the square root of the variance. 2 s s 
  • 67. Example – Using Formula Find the variance of the following dataset 6, 3, 8, 5, 3 (in hours) 6 36 3 9 8 64 5 25 3 9 x 2 x 25   x 143 2   x
  • 69. Find the standard deviation The standard deviation is the square root of the variance. 12 . 2 5 . 4   s
  • 70. Standard deviation for grouped data  For grouped data, the standard deviation has to take into account the class frequencies, the class width as well as the value of the mean.  The mean , is calculated as stated earlier; x fx x f   
  • 71. Worked Example Class Limits Frequency (f) 1.10-1.50 1 1.60-2.00 10 2.10-2.50 14 2.50-3.00 21 3.10-3.50 30 3.60-4.00 13 4.10-4.50 6 4.60-5.00 3 5.10-5.50 2 Total 100 Table Frequency distribution with Class Boundaries. Compute the standard deviation of grouped data presented in the table below.
  • 72. Worked Example Class Limits Class Boundaries Class Midpoint (x) Frequency (f) 1.10-1.50 1.05-1.55 1.3 1 1.60-2.00 1.55-2.05 1.8 10 2.10-2.50 2.05-2.55 2.3 14 2.60-3.00 2.55-3.05 2.8 21 3.10-3.50 3.05-3.55 3.3 30 3.60-4.00 3.55-4.05 3.8 13 4.10-4.50 4.05-4.55 4.3 6 4.60-5.00 4.55-5.05 4.8 3 5.10-5.50 5.05-5.55 5.3 2 Table Frequency distribution with Class Boundaries. For each class, the value that represents the class is the class midpoint. 100 f  
  • 73. Worked Example Class Limits Class Boundaries Class Midpoint (x) Frequen cy (f) Deviance (x- 1.10-1.50 1.05-1.55 1.3 1 -1.795 1.60-2.00 1.55-2.05 1.8 10 -1.295 2.10-2.50 2.05-2.55 2.3 14 -0.795 2.50-3.00 2.55-3.05 2.8 21 -0.295 3.10-3.50 3.05-3.55 3.3 30 0.205 3.60-4.00 3.55-4.05 3.8 13 0.705 4.10-4.50 4.05-4.55 4.3 6 1.205 4.60-5.00 4.55-5.05 4.8 3 1.705 5.10-5.50 5.05-5.55 5.3 2 2.205 100 f   x
  • 74. From the table above, There is need to get 2 ( ) f x x   2 ( ) 0.6572945 0.8107 f x x f        Then variance can be computed as below Standard deviation can be computed as below 2 2 ( ) 65.72945 0.6572945 100 f x x f       
  • 75. The better formula for computation is ; 2 2 fx x f     
  • 76. Interquartile Range • The interquartile range tells you the spread of the middle half of your distribution. • Quartiles segment any distribution that’s ordered from low to high into four equal parts. • The interquartile range (IQR) contains the second and third quartiles, or the middle half of your data set.
  • 77.
  • 78. Remember the range gives you the spread of the whole data set, the interquartile range gives you the range of the middle half of a data set
  • 79. Calculation of IQR The interquartile range is found by subtracting the Q1 value from the Q3 value
  • 80. Formula Explanation IQR = interquartile range Q3 = 3rd quartile or 75th percentile Q1 = 1st quartile or 25th percentile
  • 81.  Q1 is the value below which 25 percent of the distribution lies, while Q3 is the value below which 75 percent of the distribution lies.  You can think of Q1 as the median of the first half and Q3 as the median of the second half of the distribution.
  • 82. Methods for finding the interquartile range  Although there’s only one formula, there are various different methods for identifying the quartiles. You’ll get a different value for the interquartile range depending on the method you use.  Here, we will discuss two of the most commonly used methods. These methods differ based on how they use the median.
  • 83. Exclusive method vs inclusive method  The exclusive method excludes the median when identifying Q1 and Q3,  the inclusive method includes the median in identifying the quartiles. Remember!  The procedure for finding the median is different depending on whether your data set is odd- or even-numbered.
  • 84. When you have an odd number of data points, the median is the value in the middle of your data set. You can choose between the inclusive and exclusive method. With an even number of data points, there are two values in the middle, so the median is their mean. It’s more common to use the exclusive method in this case.
  • 85. There is little consensus on the best method for finding the interquartile range, the exclusive interquartile range is always larger than the inclusive interquartile range.
  • 86. The exclusive interquartile range may be more appropriate for large samples, while for small samples, the inclusive interquartile range may be more representative because it’s a narrower range
  • 87. Steps for the exclusive method  Even-numbered data set (n=10) Step 1: Order your values from low to high.
  • 88. Step 2: Locate the median, and then separate the values below it from the values above it .
  • 89. Step 3: Find Q1 and Q3. Q1 is the median of the first half and Q3 is the median of the second half. Since each of these halves have an odd number of values, there is only one value in the middle of each half.
  • 90.
  • 91. Step 4: Calculate the interquartile range.
  • 92. Odd-numbered data set (n=11) Step 1: Order your values from low to high.
  • 93. Step 2: Locate the median, and then separate the values below it from the values above it.
  • 94. Step 3: Find Q1 and Q3.
  • 95. Step 4: Calculate the interquartile range.
  • 96. Steps for the inclusive method Almost all of the steps for the inclusive and exclusive method are identical. The difference is in how the data set is separated into two halves. The inclusive method is sometimes preferred for odd-numbered data sets because it doesn’t ignore the
  • 97. n=11 Step 1: Order your values from low to high
  • 98. Step 2: Find the median.
  • 99. Step 2: Separate the list into two halves, and include the median in both halves.
  • 100. Step 3: Find Q1 and Q3.
  • 101. Step 4: Calculate the interquartile range.
  • 102. When is the interquartile range useful?  The interquartile range is an especially useful measure of variability for skewed distributions.  For these distributions, the median is the best measure of central tendency because it’s the value exactly in the middle when all values are ordered from low to high.  The IQR is also useful for datasets with outliers. Because it’s based on the middle half of the distribution, it’s less influenced by extreme values.
  • 103. Visualize the interquartile range in boxplots A boxplot, or a box-and-whisker plot, summarizes a data set visually using a five-number summary.
  • 104. Every distribution can be organized using these five numbers: Lowest value Q1: 25th percentile Median Q3: 75th percentile Highest value (Q4)
  • 105.
  • 106. The vertical lines in the box show Q1, the median, and Q3, while the whiskers at the ends show the highest and lowest values.
  • 107. In a boxplot, the width of the box shows you the interquartile range. A smaller width means you have less dispersion, while a larger width means you have more dispersion
  • 108. An inclusive interquartile range will have a smaller width than an exclusive interquartile range. Boxplots are especially useful for showing the central tendency and dispersion of skewed distributions.
  • 109. The placement of the box tells you the direction of the skew. A box that’s much closer to the right side means you have a negatively skewed distribution. A box closer to the left side tells you that you have a positively skewed distribution.
  • 110.
  • 112. Introduction to Probability  A Probability Experiment is a process which leads to well-defined results called outcomes.  For example, the toss of a coin is a probability experiment because it leads to results called outcomes such as “Heads” and “Tails”.  There so many such probability experiments such as about the toss of two coins, the roll of a die etc  The set of all possible outcomes from these probability experiments and others is called a Sample Space
  • 113. For example  If a coin is tossed, the sample space is {H,T}  If flipping two coins, the sample space is {HH, HT, TH, TT} Event  is one or more outcomes of a probability experiment  Getting a “Head” in a toss of a coin is an event.  Getting “Heads” on both tosses of two coins is an event
  • 114.  Probability is defined as the likelihood of an event happening.  The probability of an event E, denoted P(E) is a definition of how likely that event is to happen.  This definition is usually numerical.  The value of the probability of any event is always between zero and one inclusive
  • 115. Two main approaches to probability 1. The Classical Approach 2. Empirical Approach Classical approach/definition to Probability  The Classical definition of the probability of the event E is defined as the number of ways or times the event E occurs divided by the number of all possible outcomes including the event E.
  • 116. Mathematically, this can be expressed as follows: s in ( ) l s Thenumber of times or way which the Event E occurs P E The tota number of All possible Outcome including the event E 
  • 117. Example  If a doctor sees 10 patients with malaria, 5 patients with diarrhoea, 15 patients respiratory problems and 20 patients with skin diseases, he will have seen 50 patients on the day. If he needs to interview, at random, one of the patients seen on the day to give him an indication of how his service was, what is the probability that the patient to be interviewed will have skin diseases?
  • 118. (Skin Disease) l Thenumber patients with skin disease P The tota number of All patients  (Skin Disease) l 20 50 0.4 Thenumber patients with skin disease P The tota number of All patients   
  • 119. Empirical approach  Empirical probability is based on past observations.  The empirical probability of an event is the relative frequency of a frequency distribution based upon past observations.  The definition of the empirical probability of any event E is the number of times the event E occurred in the past divided by the total number of times the experiment was carried out
  • 120.  Mathematically, s in ( ) l exp Thenumber of times or way which the Event E occured P E The tota number of timesthe eriment was carried out 
  • 121. Limiting values of probability  When the probability of an event is zero (0), the event is said to be an absolute impossibility i.e. there is absolutely no way the event can happen  When the probability of an event is one (1), the event is said to be an absolute certainty 0 ( ) 0 P E  
  • 122.  Class to suggest events in life whose probability is zero  Class to suggest events in life whose probability is one.
  • 123. Counting Rules  1 Factorials Definition: Factorial 4 ! = 4 x 3 x 2 x 1 and 7! = 7 x 6 x 5 x 4 x 3 x 2 x 1  2. PERMUTATION RULES Definition: )! ( ! r n n P r n   6720 4 5 6 7 8 1 2 3 1 2 3 4 5 6 7 8 )! 5 8 ( ! 8 60 3 4 5 1 2 1 2 3 4 )! 3 5 ( ! 5 5 8 3 5           x x x x x x x x x x x x x P x x x x x x P Example
  • 125. Probability Laws  Consider a bag containing coloured marbles; 10 black, 5 red, 5 blue and 3 yellow, then the probability of picking a green marble from this bag is 0 because there are no green marbles in the bag. What is the probability of picking? A black marble? A yellow marble? A marble that is not black?
  • 126. Lets compute the probabilities 23 10 ) ( ) (   black P marbles of number Total appears marble black a ways or times of Number black P 23 13 ) ( ) (    black Not P marbles of number Total appears marble black non a ways or times of Number black Not P
  • 127.  The above results indicate that P(black)=10/23 and P(not black)=13/23 are complementary. They add up to 1 i.e. P(Black) + P(not Black) = 1. This shows that the sum of all probabilities in the sample space is 1 and also giving the basic rule of probability which says that the probability of an event occurring plus the probability of the event not occurring is equal to 1.
  • 128. P(E) + P(not E) = 1 The Addition law of probabilities
  • 129.  A pack off cards has 52 cards (excluding the Jokers). The cards are in two basic colours, black and red. Of the 52 cards, half (26 cards) are red while the other half are black. The picture above shows the 52 cards.  Flowers (13) and Spades (13) are black as shown above while Hearts (13) and Diamonds (13) are red. Each deck of cards has an Ace (the cards on the far left).  The probability of pick a Heart ) ( ) ( ) ( 52 13 ) ( flower P spade P diamond P heart P    
  • 130. 52 26 ) (Re   pack a in cards of number Total cards red of Number Card d P Note that the event "Red Card" is a compound event i.e. it contains other events. The event "Red Card" is actually the event "Hearts" or "Diamonds" i.e. 52 13 52 13 52 26 ) ( ) (Re     Diamonds or Hearts P Card d P
  • 131.  This observation is actually true for any two events which are mutually exclusive.  Events are said to be mutually exclusive if they both cannot happen at the same time.  If two events are mutually exclusive, then the probability of either event occurring is the sum of the probabilities of each occurring.  This is called the Addition Law of probabilities for mutually exclusive events.
  • 132.  In general therefore, if two events A and B are mutually exclusive, then the probability of event A or B happening is sum of the individual probabilities i.e ) ( ) ( ) ( B P A P B or A P   Example 2. What is the probability of picking a Spade or a Heart from a pack of cards? Example 3 What is the probability of picking a Spade or an Ace from a pack of cards?
  • 133.  Note that the two events “Spade” and “Ace” are not mutually exclusive because they both can happen at the same time, i.e. there is a card that is both an Ace and a Spade. The card is the Ace of Spades. ) ( ) ( ) ( ) ( B and A P B P A P B or A P   
  • 134. The Multiplication law of probabilities  Consider the toss of a coin. The probability of getting a “Head” when a coin is tossed is 0.5.  Suppose one wants to have two tosses. Is there a difference in outcomes if one person tosses twice compared to two people tossing once? Why?  The discussion will have shown that tossing a coin twice by the same person is the same as two people tossing a coin once each.  The reason is that, as far as outcomes are concerned, the result of the first toss is independent of the result of the second toss when one person tosses a coin twice.
  • 135.  In general, events are said to be independent if the occurrence of one event does not affect the occurrence of the other in any way or two events are independent if the occurrence of one does not change the probability of the other occurring.  Consider the toss of two coins; what is the probability of getting “Heads” on both tosses?
  • 136. COIN A COIN B H H H T T H T T  There are 4 possible outcomes when two coins are tossed (HH, HT, TH and TT).  Out the four possible outcomes, only one has Heads (H) on both Coin A and Coin B.
  • 137.  The probability of getting "Heads" on both coins when two coins are tossed is  P (Head and Head) = but = x  P(Head and Head) = x  In general, if events A and B are independent, the probability of event A and event B happening is given by;  P(A and B) = P(A) x P(B) 1 4 1 4 1 2 1 2 1 4 1 2 1 2
  • 138. There are 20 marbles in the bag of which 6 are red, 2 are blue and the rest are white. What is the probability of picking a white ball from the bag? The number of times white marbles occur in the bag The total number of all possible outcomes including the white marbles . . . . . . . . . . . . . . . . . . . 12 20 P(E) = =
  • 139.  If two marbles were to be picked from the bag with replacement, what is the probability that both marbles would be white?  The answer to this question is from the multiplication law of probabilities i.e. (A and B) = P (A) x P (B)  However if the marbles are picked without replacement the situation would be different.  The probability of picking a white marble the first time would
  • 140.  The probability of picking a white marble the first time would remain because the number of white marbles is 12 and the total number of marbles in the bag is 20.  When the first marble is picked and then not put back in the bag, the total number of marbles in the bag reduces to 19.  The probability of picking the colour of the marble that has been taken out of the bag.  If the marble taken out of the bag from the first pick is white, then the probability of a white marble the second time around is 12 20 11 19
  • 141.  If the marble taken out of the bag from the first pick is not white, then the probability of a white marble the second time around is  This therefore means that there are two possible solutions to the probability of picking two white marble without replacement;  P(White and White) = x if the marble in the first pick was white.  P(White and White)= x if the marble in the first pick was not white. 12 19 12 20 11 19 12 20 12 19
  • 142.  In other words, the probability of picking a white marble the second time is dependent on the result of the first pick.  We say that the probability of picking a white marble the second time is conditional on the result of the first experiment.  In general, if two events A and B are not independent, i.e. the occurrence of one event does affect the probability of the other occurring (the events are dependent) the probability of both events happening is given by;  P(A and B)=P(A) x P(B|A)
  • 143.  The probability of event B occurring given that event A has already occurred is read "the probability of B given A" and is written: P(B|A) this is the conditional probability of B given that the event A has already occurred and we have the result of that experiment.
  • 144. Probability tree diagrams  Calculating probabilities can sometimes be confusing. It may not be easy to tell when to use the addition law, the multiplication law or a combination of these.  The probability tree diagram is a tool that can be used to simplify otherwise complex looking probability problems.  A tree diagram is simply a way of representing a sequence of events which are a set of combinations of all possible outcomes from a situation.  A Tree diagram helps us to see all possible outcomes of an event at a glance and simplifies
  • 145. Example  A hospital procurement department advertised for three contracts for the supply of gloves worth hundreds of thousands (Contract A), laboratory equipment worth millions (Contract B) and a dialysis machine worth tens of millions (Contract C). A supply company bids for the three contracts. The probability of getting contract A is 0.85. The probability of getting contract B depends on whether they get contract A or not. The probability of getting contract B if they get A is 0.9 but only 0.2 if they fail to get contract A. The probability of getting contract C depends on whether they get contract B.
  • 146. It is 0.95 if they get B but only 0.1 if they fail to get contract B after getting contract A. If they fail to get A and get B, the probability of getting contract C is 0.6. If they fail to get A and fail to get B, they are not allowed to bid for contract C.
  • 147.  Draw a tree diagram to illustrate the probabilities of the outcomes. What is the probability of?  Getting all three contracts  Getting two contracts only  Getting only one contact  Getting no more than two contracts  Getting at least two contracts
  • 148.  Getting at most two contracts  Getting no contract at all  Getting contract B but not contract C  Getting contract C but not contract  Getting contract A but not the other contracts
  • 149. Probability Distributions  The right way is to start by introducing probability density functions and that will lead to aspects of calculus such as integrals which would put off many.  I will introduce probability distributions as the distribution or break up of the total probability of 1 into several possible events or outcomes.  As an example, the tree diagram has several branches which are events or outcomes. The total of probabilities from all branches is 1 but is distributed into several events
  • 150.
  • 151.  The total probability= 0.72675+0.03825+0.0085+0.0765+0.018+0.012+ 0.12 = 1  The total probability of 1 is distributed into seven different events or outcomes. The seven events are; (i) get A, get B and get C (ii) get A, get B and fail to get C (iii) get A, fail to get B, get C (iv) get A, fail to get B, fail to get C (v) fail to get A, get B and get C (vi) fail to get A, get B and fail to get C (vii) fail to get A and fail to get B
  • 152.  The probabilities of each of the seven events are presented on the ends of each branch of the tree diagram above.
  • 153.  A listing of all the values a random variable can assume with their corresponding probabilities make a probability distribution. For example, the toss of a coin: Expected Outcome (X) Head Tail Total Probability (X) 1/2 1/2 1 The total probability is 1.
  • 154.  In many other situations, the total probability will have to be distributed into several events or outcomes (leading to fractions which will eventually have to add up to 1).  This is basically the whole concept of probability distributions.
  • 155.  A random variable does not mean that the values can be anything (a random number)  Random variables have a well defined set of outcomes and well defined probabilities for the occurrence of each outcome.  For example, if you toss a coin, the known outcomes are Heads and Tails and the probability of each is 0.5 only that when a coin is to be tossed, the outcome is not known, it can be any hence the term random.
  • 156.  Similarly, when a die is rolled, the known outcomes are 1, 2, 3, 4, 5 and 6; the probabilities of each event are also known to be 1/6 but when the die is being rolled, any outcome can appear.  The random refers to the fact that the outcomes happen by chance -- that is, you don't know which outcome will occur next.
  • 157.  Here's an example of a probability distribution that results from the rolling of a single fair die. X 1 2 3 4 5 6 sum P(x) 1/6 1/6 1/6 1/6 1/6 1/6 6/6=1
  • 158. The Binomial Probability Distribution  The binomial distribution is one of the discrete probability distributions. It is discrete because the outcomes of the binomial experiments result in whole number form other than fractional.  Binomial experiments find probabilities of whole number items and not fractional ones
  • 159. Binomial Experiment  A binomial experiment has the following; 1. A fixed number of trials 2. Each trial is independent of the others 3. There are only two outcomes 4. The probability of each outcome remains constant from trial to trial.  These can be summarized as: An experiment with a fixed number of independent trials, each of which can only have two possible outcomes.
  • 160. Examples of Binomial Experiments  Tossing a coin 6 times to see how many tails occur. There is a fixed number of tosses, i.e. 6. Each toss has two possible outcomes. Each toss is independent of the other and results of each toss do not affect the results of the other tosses. The probability of getting a Head or Tail is the same throughout the 6 tosses. Asking 20 people if they watch Television Malawi (TVM). You ask a fixed number of people i.e. 20. There are two possible outcomes, either they watch or they
  • 161.  Rolling a die 5 times to see if a 5 appears.  The outcomes from tossing of coins can be arranged in a triangular pattern deliberately.  The pattern gives us a clue as to how we can have outcomes for 5 coins and more!. Observe the coefficients of the outcomes. We shall isolate them and present them as follows; 
  • 162. Tossing of 4 coins. No. Of coins Outcomes 1 1 1 2 1 2 1 3 1 3 3 1 4 1 4 6 4 1
  • 163.  You will observe that each coefficient is the sum of two coefficients above it! Such that for 5 coins, we can come up with the coefficients as follows; 1 5 10 10 5 1 For 6 coins, the coefficients will be; 1 6 15 20 15 6 1 etc.
  • 164.  The outcomes start with all successes on the left, reduce by one every step and end with all failures on the right.  The other observation is that the number of coins is the second coefficient.  The other thing to note is that the coefficients are symmetrical, whatever is on the left is the same on the right.  This triangle is called Pascal’s triangle in honour of Blaise Pascal, a French mathematician who discovered it.
  • 165.  If the probability of success is denoted p and the probability of failure q then the outcomes may be presented in terms of probabilities as follows; No. Of coins Probabilities 1 p q 2 p2 2pq q2 3 p3 3p2q 3pq2 q3 4 p4 4p3q 6p2q2 4pq3 q4
  • 166.  For each experiment (coin), the total probability is always equal to 1 i.e. p + q = 1 p2 + 2pq + q2 = 1 p3 + 3p2q + 3pq2 + q3 = 1 . p4 + 4p3q + 6p2q2 + 4pq3 +q4 = 1etc
  • 167.  From your mathematics in secondary school, you will recall the expansion of binomials such as (x+y)2 , (a+b)3 etc.  If you expand (a+b), (a+b)2, (a+b)3, (a+b)4...., the coefficients of the terms are exactly the same as the ones in Pascal’s triangle such that we can use this property for probabilities i.e., for any n binomial trials or experiments whereby the probability of success is p and the probability of failure is q, the probability distribution of the n experiments is given by:
  • 168. ..... ! 3 ) 2 )( 1 ( ! 2 ) 1 ( ) ( 3 3 2 2 1             q p n n n q p n q np p q p n n n n n Example: What is the probability of rolling exactly two sixes in 6 rolls of a die? There are five basic things you need to do to work a binomial problem like this one.
  • 169. 1. Firstly define Success. Success in this case must be for a single trial. Success = "Rolling a 6 on a single die" 2. Define the probability of success p: p = 1/6 3. Find the probability of failure which is 1 - p: q = 5/6 4. Define the number of trials: n = 6 5. Define the number of successes out of those trials: x = 2
  • 170. ..... ! 3 ) 2 6 )( 1 6 ( 6 ! 2 ) 1 6 ( 6 ) ( 3 3 6 2 2 6 1 6 6 6             q p q p q p p q p We need the term containing p2 which is the probability of two successes. The term is; 4 4 6 ! 4 ) 3 6 )( 2 6 )( 1 6 ( 6 q p    
  • 172. Apart form using knowledge of Pascal’s Triangle, we can use the knowledge of counting rules  Example: What is the probability of rolling exactly two sixes in 6 rolls of a die?
  • 173. 1.Firstly define Success. Success in this case must be for a single trial. Success = "Rolling a 6 on a single die" 2. Define the probability of success p: p = 1/6 3. Find the probability of failure which is 1 - p: q = 5/6 4. Define the number of trials: n = 6 5. Define the number of successes out of those trials: x = 2
  • 175. Example:  A coin is tossed 10 times. What is the probability that exactly 6 heads will occur.
  • 176. Mean, Variance and Standard Deviation
  • 177. Example:  Find the mean, variance, and standard deviation for the number of sixes that appear when rolling 30 dice.
  • 178. Normal Distribution • Bell shaped. • Gaussian curve” after the mathematician Karl Friedrich Gauss.
  • 179. • Normal distributions are symmetric around their mean. • The mean, median, and mode of a normal distribution are equal and located at the peak. • The area under the normal curve is equal to 1.0. • Normal distributions are denser in the center and less dense in the tails. Properties of a Normal Distribution
  • 180. This is to say that the normal probability distribution is asymptotic - the curve gets closer and closer to the x-axis but never actually touches. Normal distributions are defined by two parameters, the mean (μ) and the standard deviation (σ). Properties of a Normal Distribution
  • 181. 68% of the area of a normal distribution is within one standard deviation of the mean. Approximately 95% of the area of a normal distribution is within two standard deviations of the mean. Properties of a Normal Distribution
  • 182. Properties of a Normal Distribution The parameters μ and σ are the mean and standard deviation, respectively, and define the normal distribution. The symbol e is the base of the natural logarithm and π is the constant pi. 2 1 ( ) 2 1 ( ) 2 x f x e        The density of the normal distribution (the height for a given value on the x- axis) is shown below.
  • 183. Empirical Rule • Approximately 68 % of the data lies in the interval    Figure 1. Empirical Rule
  • 184. Empirical Rule Example 1: Figure 2 shows a normal distribution of age of patients with a mean of 50yrs and a standard deviation of 10. The shaded area is between 40yrs and 60yrs. What proportion of distribution does the area contain. Figure 2. Normal distribution of age of patients
  • 185. Empirical Rule Example 2: A normal distribution of concentration of glycogen in the blood has a mean of 75mg and a standard deviation of 10. The shaded area on the normal distribution graph extends from 55.4mg to 94.6mg.
  • 186. a. How many standard deviations are within the shaded area? b. Using Empirical rule, approximate the proportion of the shaded area under the curve.
  • 187. Standard Normal Distribution i i x z     The standard score and the standardized variable For a population, the standard score (also called the normal deviate, or z score or z value) is defined as: and for a sample it is indicated as i i x x z s  
  • 188. Standard Normal Distribution The standard score (z) shows how far any given data value is from the mean of the distribution in standard deviation units; how many standard deviations the value is from the mean. i x
  • 189. When for any variable X, each measurement value in a sample or population is transformed into a z value, this process is known as standardizing (or normalizing) the variable, and the resulting variable Z is called a standardized variable. Standard Normal Distribution
  • 190. Standard Normal Distribution Example 3: Assuming the following sample follows normal distribution, first calculate and s, and then standardize the sample to have a standard normal distribution: 3, 5, 7, 9, and 11.
  • 191. Standard Normal Distribution Solution: 35 7 5 i x x n     2 2 2 ( ) 5(285) (35) 3.16228 ( 1) 5(4) i i n x x s n n        
  • 192. Standard Normal Distribution 1 2 3 4 5 3 7 1.2649 3.16228 5 7 0.6325 3.16228 7 7 0 3.16228 9 7 0.6325 3.16228 11 7 1.2649 3.16228 i i i i i x x z s x x z s x x z s x x z s x x z s                            Having determined s and , we can proceed and compute z score for each observation. x x
  • 193. Finding Areas under the Standard Normal Distribution curve Standard Normal Cumulative Probability Table provides the cumulative distribution function for values of z rounded to the nearest hundredth.
  • 194. This table provides the area under the standard normal curve for values of z less than those identified in the table. This is illustrated in the figure on the right with the shaded region, labelled probability.
  • 195. Figure: Area under the curve
  • 196.  The table below demonstrates how to use the table to find the area under the standard normal curve that lies to the left of Z value.  Lets suppose Z= 1.46. Notice that the value 1.46 = 1.4 + .06.  The value 1.4 is found by scrolling down the first column of the table and the value .06 is found by moving right across the top row.
  • 197.  The intersection within the table of the row of 1.4 and the column of .06 is the value .9279. This is the area under the normal curve to the left of Z = 1.46.
  • 198. Table 1. Standard Normal Cumulative Probability Table
  • 199. Often times, we are interested in finding the Z-score that corresponds to a given area under the standard normal curve. The process involves searching the array of area values and working backwards to find the Z-score
  • 200. Example 4: Using the tables, find the Z-score that corresponds to an area of 0.9050 under the standard normal curve to the left of the Z- score. When searching the array of values, the closes one we see is .9049. This value is in the row of 1.3 and the column of .01. Thus, the Z-
  • 201. Table 2. Standard Normal Cumulative Probability Table
  • 202. Exercises 1. Use Tables to find the following areas under the standard normal curve. 1. The area that lies to the left of Z = -0.58. 2. The area that lies between Z = -1.16 and Z = 2.71. 3. The area that lies to the right of Z = 0.31.
  • 203. Exercises 2. 1. Find the Z-score so that the area to the left of the Z-score is 0.10. 2. Find the Z-score so that the area to the right of the Z-score is 0.0735.
  • 204.  We are often interested in finding the Z-score that has a specified area to the right. For this reason, we have special notation to represent this situation
  • 205.  The notation  Pronounced as Z sub alpha is the Z-score such that the area under the standard normal curve to the right of is  Find the value of  z  z  05 . 0 z
  • 206.  This means that the area under the curve is 0.05 and we need to find the corresponding values. Since our tables indicate areas of z scores to the left, let’s find the area of curve to the left of the z score i.e 1-0.05=0.95
  • 207.  Now let’s find the z score corresponding to the 0.95. From the tables, the corresponding z value is 1.65
  • 208. as a probability distribution curve  Recall that the area under the standard normal distribution can be interpreted as either a probability or as the proportion of the population with the given characteristic. When interpreting the area under the standard normal curve as a probability, we use the following notation  Notation for the Probability of a Standard Normal Random Variable  P(a < Z < b) represents the probability that a standard normal random variable is between a and b
  • 209. P(Z > a) represents the probability that a standard normal random variable is greater than a. P(Z < a) represents the probability that a standard normal random variable is less than a.  Example 5: Let Z denote a sample of glucose amount in the blood of patients which follows a normal distribution with a mean of 0 and standard deviation of 1. a. Find P (Z > 2). b. Find P (Z ≤ 1.73).
  • 210.  Solution: Since μ=0 and σ=1, the value of 2 is actually z=2 standard deviations above the mean. Proceed down the first (z) column in standard normal tables and read the area opposite z=2.0. This area denoted by the symbol P(z), is P(2.0)= 0.9772. But this is the probability to the left of z score. For P(Z > 2)=1-0.9772=0.0228. Therefore P(Z > 2)=0.0228 Z=1.73, therefore P(1.73)=0.9582 Therefore P(Z < 1.73)=0.9582
  • 211.  Example 6: The achievement scores for a college entrance examination are normally distributed with mean 75 and standard deviation 10. What fraction of the scores lies between 80 and 90?  Solution The desired fraction of the population is given by the area between 5 . 1 10 75 90 5 . 0 10 75 80 2 1       z and z
  • 212.  P(0.5 < z < 1.5)=P(0.5)-P(1.5)=0.3085- 0.0668=0.2417  Therefore the fraction of the scores lying between 80 and 90 is 0.2417
  • 213.  Exercises 3.  Let X denote a normal random variable with mean 0 and standard deviation 1.  Find P(−2 ≤ Z ≤ 2).  The grade point averages (GPAs) of a large population of Public Health College students are approximately normally distributed with mean 2.4 and standard deviation 0.8. If students possessing a GPA less than 1.9 are dropped from college, what percentage of the students will be dropped?
  • 214.  The weekly amount of money spent on cleaning the city was observed, over a long period of time, to be approximately normally distributed with mean $400 and standard deviation $20. How much should be budgeted for weekly cleaning to provide that the probability the budgeted amount will be exceeded in a given week is only 0.1?
  • 215. Suppose a clinically accepted value for mean systolic blood pressure in males aged 20 to 24 years is 120 mmHg and the standard deviation is 20 mmHg. a). If a 22 year old male is selected at random from the population, what is the probability that his systolic blood pressure is equal to
  • 217.  Statistical inference is the estimation of the population parameters such as the population mean, the population proportion etc. derived from the analysis of a sample drawn from that population.  A sample is a small part of the population which is used to analyse as an example of the character, features or qualities of the population.
  • 218.  Sampling is the process of selecting a sample of people or products from a population which is to be used as a representative of the population of interest.  An estimate is an approximate calculation of something and estimation is the process of coming up with an estimate of a population parameter.  There are several sampling methods which are important for you to know in order to appreciate the process of sampling and estimation, the methods are briefly described below and you should take time to read around them from other
  • 219. Sampling Methods Probability sampling Non probability sampling In the probability sample every member of the wider population has an equal chance of being included in the sample; inclusion or exclusion from the sample is a matter of chance and nothing else. In the non-probability sample some members of the wider population definitely will be excluded and others definitely included (i.e. every member of the wider population does not have an equal chance of being included in the sample)
  • 220. Types of Probability Sample 1. Simple random sampling Each member of the population under study has an equal chance of being selected and the probability of a member of the population being selected is unaffected by the selection of other members of the population. One problem associated with this particular sampling method is that a complete list of the population is needed and this is not always readily available
  • 221. 2. Systematic Sampling It involves selecting subjects from a population list in a systematic rather than a random fashion. For example, if from a population of, say, 2,000, a sample of 100 is required, then every twentieth person can be selected. The starting point for the selection is chosen at random.
  • 222. 3. Stratified random sample  Stratified sampling involves dividing the population into homogenous groups, each group containing subjects with similar characteristics.  A stratified random sample is, therefore, a useful blend of randomization and categorization, thereby enabling both a quantitative and qualitative piece of research to be undertaken.
  • 223. 4. Cluster sampling  It involves the sampling of successively smaller units  Conditions for doing cluster sampling 1. The sampling frame can not be identified 2. Direct contacts needs to be made with the sample units, but these are scattered around a wide geographical area
  • 224.  Cluster sampling is an example of 'two-stage sampling' or 'multistage sampling': in the first stage a sample of areas is chosen; in the second stage a sample of respondents within those areas is selected. 
  • 225.  Multistage sampling Multistage sampling is a complex form of cluster sampling in which two or more levels of units are embedded one in the other. The first stage consists of constructing the clusters that will be used to sample from. In the second stage, a sample of primary units is randomly selected from each cluster (rather than using all units contained in all selected clusters). In following stages, in each of those selected clusters, additional samples of units are selected, and so on.
  • 226.  All ultimate units (individuals, for instance) selected at the last step of this procedure are then surveyed. This technique, thus, is essentially the process of taking random samples of preceding random samples.
  • 227. Non probability samples 1. Convenience (Accidental/Opportunity) Sampling It involves choosing the nearest individuals to serve as respondents and continuing that process until the required sample size has been obtained The researcher simply chooses the sample from those to whom she has easy access. As it does not represent any group apart from itself, it does not seek to generalize about the wider population
  • 228. 2. Quota Sampling A quota sample strives to represent significant characteristics (strata) of the wider population and it sets out to represent these in the proportions in which they can be found in the wider population. For example, suppose that the wider population (however defined) were composed of 55% females and 45% males, then the sample would have to contain 55% females and 45% males
  • 229. 3. Purposive Sampling In purposive sampling, researchers handpick the cases to be included in the sample on the basis of their judgement of their typicality. In this way, they build up a sample that is satisfactory to their specific needs Assumptions for one to use purposive sampling: 1. They possess the necessary knowledge 2. They have relevant experience 3. They are part of the social structure or process on which the research is intended to focus
  • 230. 4. Snowball Sampling A researchers identify a small number of individuals who have the characteristics in which they are interested. These people are then used as informants to identify, or put the researchers in touch with, others who qualify for inclusion and these, in turn, identify yet others This method is useful for sampling a population where access is difficult, maybe because it is a sensitive topic or where communication networks are undeveloped
  • 231. What sample size do I need?” The answer to this question is influenced by a number of factors, including:  the purpose of the study, population size, the risk of selecting a “bad” sample and the allowable sampling error.  Data analysis plan e.g number of cells one will have in cross tabulation  Most of all whether undertaking a qualitative or quantitative study
  • 232. Sample size determination in qualitative study  Probability sampling not appropriate as sample not intended to be statistically representative  But, sample should have ability to represent salient characteristics in population.  Sample size taken until point of theoretical saturation
  • 233. …….  Sample size is usually small to allow in-depth exploration and understanding of phenomena under investigation  Ultimately a matter of judgement and expertise in evaluating the quality of information against final use, research methodology , sampling strategy and results is necessary.  In practice, qualitative sampling usually requires a flexible, pragmatic approach.
  • 234. …..  The researcher actively selects the most productive sample to answer the research question.  This can involve developing a framework of the variables that might influence an individual's contribution and will be based on the researcher's practical knowledge of the research area, the available literature and evidence from the study itself. • This is a more intellectual strategy than the simple demographic stratification of epidemiological studies, though age, gender and social class might be important variables.
  • 235. …….  If the subjects are known to the researcher, they may be stratified according to known public attitudes or beliefs  It may be advantageous to study a broad range of subjects : • (maximum variation sample) • outliers (deviant sample) • subjects who have specific experiences (critical case sample) • subjects with special expertise (key informant sample).
  • 236. …….  The iterative process of qualitative study design means that samples are usually theory driven ( theoretical sampling) to a greater or lesser extent
  • 237. Some suggestions of sample size in qualitative studies  The smallest number of participants should be 15  Should lie under 50  6-8 participants for FGDs AND at least 2 FGDs per population group IMPORTANT  Attainment of saturation  Justification of choice of number
  • 238. Sample size determination in quantitative study Several criteria will need to be specified to determine the appropriate sample size: Level of precision, Level of confidence or risk, Degree of variability in the attributes being measured ( prevalence) External validity
  • 239. …….  The Level of Precision-sometimes called sampling error  range in which the true value of the population is estimated to be.  This range is often expressed in percentage points (e.g., ±5 percent).  The Confidence Level  based on ideas encompassed under the Central Limit Theorem.  E.g a 95% confidence level is selected, 95 out of 100 samples will have the true population value within the range of precision
  • 240. ……. Degree of Variability refers to the distribution of attributes in the population. The more heterogeneous a population, the larger the sample size required to obtain a given level of precision. The less variable (more homogeneous) a population, the smaller the sample size.
  • 241. ……  A proportion of 50 % indicates a greater level of variability than either 20% or 80%. This is because 20% and 80% indicate that a large majority do not or do, respectively, have the attribute of interest.  Because a proportion of 0.5 indicates the maximum variability in a population, it is often used in determining a more conservative sample size, that is, the sample size may be larger than if the true variability of the population attribute were used.
  • 242. ……  Sample size affects accuracy of representation; Larger sample means less chance of error  Minimum suggested sample is 30 and upper limit is 1,000 External validity – how well sample generalizes to the population, a representative sample is required (not the same thing as variety in a sample)
  • 243. Strategies for Determining Sample Size There are several approaches to determining the sample size.  Using a census for small populations  Imitating a sample size of similar studies  Using published tables  Applying formulas to calculate a sample size
  • 244. Using a Census for Small Populations ….  One approach is to use the entire population as the sample.  Although cost considerations make this impossible for large populations.  Attractive for small populations (e.g., 200 or less).  Eliminates sampling error and provides data on all the individuals in the population.  Some costs such as questionnaire design and developing the sampling frame are “fixed,” that is, they will be the same for samples of 50 or 200.  Finally, virtually the entire population would have to be sampled in small populations to achieve a desirable level of precision
  • 245. Using a Sample Size of a Similar Study  Use the same sample size as those of studies similar to the one you plan( Cite reference).  Without reviewing the procedures employed in these studies you may run the risk of repeating errors that were made in determining the sample size for another study.  However, a review of the literature in your discipline can provide guidance about “typical” sample sizes that are used.
  • 246. Using Published Tables  Published tables provide the sample size for a given set of criteria.  Necessary for given combinations of precision, confidence levels and variability.  The sample sizes presume that the attributes being measured are distributed normally or nearly so.  Although tables can provide a useful guide for determining the sample size, you may need to calculate the necessary sample size for a different combination of levels of precision, confidence, and variability.
  • 247. Sample Size for ±5%, ±7% and ±10% Precision Levels where Confidence Level Is 95% and P=.5. Size of Populatio n Sample Size (n) for Precision (e) of: ±5% ±7% ±10% 100 81 67 51 125 96 78 56 150 110 86 61 175 122 94 64 200 134 101 67 225 144 107 70 250 154 112 72 275 163 117 74 300 172 121 76 325 180 125 77 350 187 129 78 375 194 132 80 400 201 135 81 425 207 138 82 450 212 140 82
  • 248. Using Formulas to Calculate a Sample Size  Sample size can be determined by the application of one of several mathematical formulae.  Formula mostly used for calculating a sample for proportions. For example:  For populations that are large, the Cochran (1963:75) equation yields a representative sample for proportions.  Fisher equation, Mugenda etc
  • 249. Cochran equation Where n0 is the sample size, Z2 is the abscissa of the normal curve that cuts off an area α at the tails; (1 – α) equals the desired confidence level, e.g., 95%); e is the desired level of precision, p is the estimated proportion of an attribute that is present in the population,and q is 1-p. The value for Z is found in statistical tables which contain the area under the normal curve. e.g Z = 1.96 for 95 % level of confidence 2 2 2 / 0 e pq z n  
  • 250. ….. A Simplified Formula For Proportions  Yamane (1967:886) provides a simplified formula to calculate sample sizes.  ASSUMPTION:  95% confidence level  P = .5 ;
  • 251. …….. Where n is the sample size, N is the population size, e is the level of precision.
  • 252. Finite population correction for proportions  With finite populations, correction for proportions is necessary  If the population is small then the sample size can be reduced slightly.  This is because a given sample size provides proportionately more information for a small population than for a large population.  The sample size (n0) can thus be adjusted using the corrected formulae
  • 253. ….. Where n is the sample size N is the population size. no is calculated sample size for infinite population
  • 254. Note  The sample size formulae provide the number of responses that need to be obtained. Many researchers commonly add 10 % to the sample size to compensate for persons that the researcher is unable to contact.  The sample size also is often increased by 30 % to compensate for non-response ( e.g self administered questionnaires).
  • 255. Use of software in sample size determination Depending on type of study and specific software Some information will be required:  Population sample size, population standard deviation, population sampling error, confidence level, z –value, power of study etc …  80% power in a clinical trial means that the study has a 80% chance of ending up with a p value of less than 5% in a statistical test (i.e. a statistically significant treatment effect) if there really was an important difference (e.g. 10% versus 5% mortality) between treatments.
  • 256. Further considerations  The above approaches to determining sample size have assumed that a simple random sample is the sampling design.  More complex designs, e.g. case control studies etc , one must take into account the variances of sub- populations, strata, or clusters before an estimate of the variability in the population as a whole can be made.
  • 257. Estimation  Inferential statistics is the estimation of the population parameters from the sample statistics.  The sample statistics are calculated from the sample data and the population parameters are inferred (or estimated) from the sample statistics.  In estimation, we are concerned with unknown population parameters such as a population mean which is unknown but is required
  • 258.  Such situations force us to take samples, find sample statistics and use them to infer upon the unknown population parameters.  We can estimate an unknown population parameter in two main ways; (i) By calculating a point estimate from the samples. A point estimate is a single value from the sample such as the sample mean used to estimate an unknown population parameter such as the population mean µ.
  • 259. (ii) You can also calculate an interval estimate which is a range within which the unknown population parameter is expected to fall.  Whether we find a point estimate or an interval estimate, in both cases, we are trying to find or estimate the value of an unknown population parameter. The estimator so found must satisfy three conditions:
  • 260. (i) It must be unbiased: The expected value of the estimator must be equal to the population parameter, (ii) Consistent: The value of the estimator approaches the value of the parameter as the sample size increases, (iii) Relatively Efficient: The estimator has the smallest variance of all estimators which could be used.
  • 261. Estimating a population mean  Consider a population whose mean µ is unknown as illustrated by Figure below
  • 262. Note that the large areas is the population while the smaller areas are samples taken from the population. In order to estimate µ, we will need to take samples from the population and calculate the sample means Each sample mean , is trying to estimate µ individually. However, note that the best estimate of µ is the mean of the sample means called the mean of the x
  • 263. 1 2 3 4 ( ... ) n x x x x x x n       For large n,    x (the mean of the sampling distribution of means is equal to the population mean)
  • 264. is a point estimate of the population mean μ. It is called a consistent estimator because its value gets closer to the population mean μ as the sample size n increases. Irrespective of the number of samples under consideration, a point estimate is likely to be different from its corresponding population parameter It is for this reason that interval estimates are preferred to point estimates. x 
  • 265.  An interval estimate is a range within which an expected value is expected to fall.  You may be asked to estimate the day when the first rains will fall this year.  A point estimate would be to say the first rains will fall on 12th October.  We are saying this estimate is unlikely to be correct however, by giving a range within which the date of the first rains falls would be a better estimate of the date when the first rains will fall.
  • 266.  The wider the range, the more the confidence that indeed the first rains will fall within that period.  I can say, for example that the first rains will fall between 1st October and 31st January. From experience, first rains always fall after 1st October and way before 31st January.  I can therefore say that I am 100% confident that the first rains will fall between 1st October and 31st January.
  • 267.  You will also agree this level of confidence because you know very well the rains don’t fall until way after 1st October and way before 31st January. If we are 100% confident (probability=1) then we can represent this on a normal curve.
  • 268. Figure showing confidence level and Limits
  • 269.  The level of confidence is called the Confidence Level (100%) while the dates 1st October and 31st January are called the Confidence Limits.  If I shift the confidence limits and ask what your level of confidence is that the first rains will come between 1st November and 31st December, you may not be 100% confident.  The level of confidence drops because you know there are many years when the first rains have come in October.
  • 270.  You also know that first rains have sometimes come as late as after Christmas.  Your level of confidence may therefore be say 96%. The 4% is the likelihood that you are wrong, that the first rains could come before 1st November and after 31st December.
  • 271. Figure showing 96% confidence level, confidence limits and confidence interval for the day the first rains will fall in Malawi.
  • 272.  The above approach is also true for any unknown population parameter such as the population mean μ.  A confidence interval is an interval estimate with a specific level of confidence.  A level of confidence is the probability that the interval estimate will contain the parameter. In other words, it is the percent of the time the true mean will lie in the interval estimate given.
  • 273.  The confidence interval is therefore a range within which an unknown population parameter is expected to fall.  The confidence limits are values within which the level of confidence is declared.
  • 274.  For the estimation of μ, the sample means will be different from each other and also from their mean As a result, the sample means will have a standard deviation about them This standard deviation is called the standard error (SE). The Central Limit Theorem states that irrespective of the distribution of the parent population (whether normal or not), the sampling distribution of means will be normally distributed (i.e. the sample means n x x x x x ... , , , 4 3 2 1 x  x  x  x 
  • 275.  known as the standard error.  Figure showing sample means normally distributed.
  • 276.  This means we can use the normal distribution tables to determine the probability of value (sample mean) having any value of interest provided we know the mean of the distribution (mean of sample means, ) and the standard deviation of the distribution, x  x 
  • 277.  The standard error of the mean (SEM) is the standard deviation of the sampling distribution of means. It can also be viewed as the standard deviation of the error in the sample mean relative to the true mean, since the sample mean is an unbiased estimator of μ.  SEM is usually estimated by the population standard deviation divided by the square root of the sample size: n or n SD x x     
  • 278.  Where;  σ is the standard deviation of the population.  n is the size (number of observations) of the sample.
  • 279.  Figure showing the 95% confidence interval estimate for μ.
  • 280.  We are 95% confident that the population mean value from which the sample was taken falls somewhere between X1 and X2.  Any confidence interval is given with a level of confidence which is given in percentage terms. You can have a 95% confidence interval or a 99% confidence interval or indeed any level of confidence. The level of confidence determines the number of standard deviations from the mean (Z) any sample mean value is from  The value of Z is obtained from tables. For 95% level of confidence, the area at the centre is 0.95. x 
  • 281.  You need to search for Inside the tables to be able to read the corresponding Z value. The value of Z for area=0.025 is ±1.96. The confidence interval is from X1 and X2, the 95% confidence interval is shown below. 025 . 0 2 % 5 
  • 282.
  • 283.  The position X1 is 1.96 standard deviations (standard errors) less than the mean i.e .  Similarly, the position on Figure 3.6 X2 is 1.96 standard errors more than the mean. This therefore means that;  The 95% confidence interval  Since we know that Then the 95% confidence interval for μ x  x x X   96 . 1 1   x x x x to X     96 . 1 96 . 1 1    n x    n to n X x x     96 . 1 96 . 1 1   
  • 284.  The 95% confidence interval for μ is therefore,  Similarly, we can find the 99%, 98% 90% etc confidence intervals for the population mean given data from samples taken from that population. n X x   96 . 1 1  
  • 285.  Example  As part of a malaria control programme it was planned to spray all 10 000 houses in a rural area with insecticide and it was necessary to estimate the amount that would be required. Since it was not feasible to measure all houses, a random sample of 100 houses was chosen and the sprayable surface of each of the these was measured.
  • 286.  The mean sprayable surface area for these 100 houses was 23.2 m2 and the standard deviation was 5.9m2. (a)Calculate the standard error about the estimate of the population mean . (b) What is a standard error? (c)The 95% confidence interval of the population mean was 22.0m2 to 24.4m2, what is a confidence interval? (d) What is the difference between a standard error and a standard deviation?
  • 287. SOLUTION (a) The standard error of the population mean µ (b) A standard error is the standard deviation of the sampling distribution of means. It is related to the population standard deviation in this way (c) A confidence interval is a range within which an unknown population parameter is expected to fall. n x    2 59 . 0 100 9 . 5 m n x      n x   
  • 288. (d) A standard error is the standard deviation of the sampling distribution of means which means on average how far away from their mean sample means are on average while a standard deviation is a measure of how far away from a mean a set of data is on average.
  • 289. Estimating a population proportion  One of the population parameters that need to be estimated is the population proportion p.  If a population proportion such as the prevalence of a disease in the entire population is unknown, it may be estimated through sampling the population as discussed.  The sample statistics are the best estimates of the unknown population proportion. The population proportion, ρ can be estimated from the sample proportion p.
  • 290.  The 95% confidence interval for the population proportion ρ is given by;
  • 291.  95% Confidence interval for ρ  Note that SE(ρ) is given by n p p z p ) 1 ( 2 /     n p p ) 1 ( 
  • 292.  Example A health survey was carried out in Mangochi urban in 2014 among 123 adults chosen at random. The survey, among other things asked respondents when they last visited a sing’anga (an African medicine man). The answers revealed that 34 of them had not visited a sing’anga for over 2 years. (a) Calculate an estimate proportion of adults who had not visited a sing’anga for over two years.
  • 293. (b) Find the 95% confidence intervals for the proportion adults who had not visited a sing’anga. What is the meaning of this confidence interval to you? (c)If a narrower confidence interval of this proportion was required, what would you recommend to the researchers? (d)What percentage of adults in Mangochi had visited a sing’anga in the past 2 years? Calculate the 98% confidence interval about this proportion.
  • 294. Solutions a) Proportion of adults who had not visited a sing’anga for the past two years. b) Find the 95% confidence intervals for the proportion adults who had not visited a sing’anga. What is the meaning of this confidence interval to you? % 64 . 27 2764 . 0 123 34 or p  
  • 295. the 95% Confidence interval for 3554 . 0 1972 . 0 123 2764 . 0 1 ( 2764 . 0 96 . 1 2764 . 0 ) 1 ( 2 / to p p n p p z p p        
  • 296.  It means that we have observed from the sample that 27.64% of adults did not visit a sing’anga for the past two years but if we were to deal with the whole population, the proportion of adults who would not have visited a sing’anga would be somewhere between 19.72% and 35.54% (c) If a narrower confidence interval of this proportion was required, what would you recommend to the researchers? In order to narrow the confidence interval (i.e. a more precise estimate) you need to increase the sample size. (d) What percentage of adults in Mangochi had
  • 297.  The percentage of adults who had visited a sing’anga
  • 298. DATA COLLECTION AND MANAGEMENT  Data collection is a major part of the research process.  Methods and instruments for data collection must be chosen according to the nature of the problem, approach to the solution and variables being studied
  • 299. Qualitative Data Collection Methods 1. Collecting verbal data  Verbal data primarily consist of words resulting from various methodological approaches which are common that research participants speak about such as events, experiences, practices, and so on.  This is achieved through interviews, focus group discussions and narratives
  • 300. The three main methods of data collection 1. In-depth interviews (IDIs) Interviewing is often used in qualitative studies to elicit meaningful data. In interviews, the interviewer writes down responses verbatim or uses a tape-recorder for later transcription.
  • 301.  IDIs in qualitative research encourage subjects to express their views at length.  The respondent is usually interviewed at a place convenient to them.  An interview schedule, sometimes called an interview guide, is a list of topics administered to subjects by a skilled interviewer.
  • 302.  The researcher may be able to obtain more detailed information from each participant, but loses the richness that can arise in a group (FGD) in which people debate issues and exchange views. Example: Please describe your experiences on the day you were discharged from the hospital.
  • 303.  The interview helps reveal more about beliefs and attitudes and behaviour according to the respondent.  IDIs normally use open-ended questions which permit free responses which should be recorded in the respondents’ own words. Such questions are useful for obtaining in- depth information on:
  • 304.  1. Facts with which the researcher is not very familiar,  2. Opinions, attitudes and suggestions of informants,  3. Sensitive issues.
  • 305.  In order to have quality data with open ended questions there is need to 1. Thoroughly train and supervise the interviewers or select experienced research assistants. 2. Prepare a list of further questions to keep at hand to use to ‘probe’ for answer(s) in a systematic way
  • 306. 3. Pre-test open-ended questions and, if possible, pre-categorise the most common responses, leaving enough space for other answers.
  • 307. 2. Semi-structured interviews  Semi-Structured Interviews allow participants to provide specific answers to questions in their own words. When open-ended questions are included in the data collection tool, respondents must write out their responses.  The focus of the interview is decided by the researcher and there may be areas the researcher is interested in exploring.
  • 308.  The researcher tries to build a rapport with the respondent and the interview is like a conversation.
  • 309. (FGDs)  For this method the researcher brings together a small number of subjects usually between 6 and 12 to discuss the topic of interest.  The group size is kept deliberately small, so that its members do not feel intimidated but can express opinions freely.  The small number of participants also makes discussion manageable by the
  • 310.  However, very few participants may result in an inadequate discussion and too many may lead to social loafing by others.  A focus group questionnaire is called a "discussion guide", and is more of a check list of questions than a fully structured questionnaire.  This is because the trick with focus groups is to put the group firmly in
  • 311.  The use of purposive sampling is most often employed when individuals known to have a desired expertise are sought.
  • 312. Direct observation  Data can be collected by an external observer, referred to as a non- participant observer.  Or the data can be collected by a participant observer, who can be a member of staff undertaking usual duties while observing the processes of care. 
  • 313.  In this type of study the researcher aims to become immersed in or become part of the population being studied, so that they can develop a detailed understanding of the values and beliefs held by members of the population.  Sometimes a list of observations the researcher is specifically looking for is prepared before-hand, other times the observer makes notes about anything they observe for analysis later.
  • 314. Quantitative Data Collection Method 1. Questionnaires  A questionnaire is an instrument with closed questions or statements to which a respondent must react.  Close-ended questionnaires ask subjects to select an answer from among several choices.  The alternatives may range from a simple ‘yes’ or ‘no’ to complex expressions of opinion.
  • 315.  Examples 1.Have you been hospitalized as an inpatient at any time in the past 5 years? a. Yes b. No
  • 316. 2. How important is it to you to avoid a pregnancy at this time? a. Extremely important b. Very important c. Somewhat important d. Not important
  • 317. Scales  A scale is a set of numerical values assigned to responses, representing the degree to which subjects possess a particular attitude, value or characteristic.  Likert Scales
  • 318.  Likert scales, also called summative scales, require subjects to respond to a series of statements to express a viewpoint.  Subjects read each statement and select an appropriately ranked response.  Response choices commonly address agreement, evaluation, or frequency.
  • 319.  Likert’s original scale included five agreement categories: “strongly agree (SA), “ agree (A)”, “uncertain (U,”) “disagree (D),” and strongly disagree (SD).”  The number of categories in the Likert scale can be modified: it can be extended to seven categories (by adding “somewhat disagree” and “somewhat agree”) or reduced to four categories (by eliminating “uncertain”).
  • 320. For example:  What is your opinion on the following statement? ‘Women who have induced abortion should be severely punished.’
  • 321. Data Management  Data management consists of those activities aimed at achieving a systematic, coherent manner of data collection, storage and retrieval.  How data are stored and retrieved is at the heart of data management.
  • 322.  A good storage and retrieval system is critical for keeping track of what data are available, for permitting easy, flexible, reliable use of data and for documenting the analysis made so that the study can, in principle, be verified or replicated.  A system for storage and retrieval should be designed prior to the actual data collection.
  • 323.  In data management, you may consider some of the following points or questions:  The principal investigator is responsible for ensuring that data are of high quality by, for example, completely checking a subset of all completed interviews.  Data organization: How will you name your data files? How will you organize your data into folders?
  • 324.  Access & security: Who will have access to your data? If the data is sensitive, how will you protect it from unauthorized access?  Storage: Where will your data be stored?
  • 325.  Backups: This is probably the single most important item on this list. Hard drives on desktop and laptop computers fail regularly. You must have a credible backup strategy of regular backups, and of course you must then follow it. Consider including an off-site backup so that your data will not be lost if your building burns down or if your computer is stolen. Rather than relying on memory, consider an automated backup
  • 326.  A large amount of qualitative data can be stored on computers using a variety of available computer applications. Therefore, gaining as much knowledge as possible about computer programs is critical.  It is recommended that original data be preserved for not less than a period of 5 years, as there is reasonable expectation that the original data will continue to be the basis of ongoing

Notas do Editor

  1. waoh
  2. Big boss
  3. kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkwwwwwwwwww