Here are the key steps in systematic random sampling:
1. Number the sampling frame from 1 to N
2. Calculate the sampling interval K by dividing the total population by the desired sample size
3. Select a random number between 1 and K to determine the first sample unit
4. Then select every Kth unit after that to complete the sample
For example, if sampling 100 units from a population of 400, the sampling interval K would be 4. A random start between 1-4 would be selected, then every 4th unit after that. This ensures a systematically selected random sample.
3. 1.1.Introduction to Research
What is Research?
• A scientific study to seek hidden knowledge
• A scientific study to answer a question
• A scientific study of causes and effects
• A scientific attempt towards new discoveries
• A systematic method of inquiry
• A logical attempt to find answers to problems
• A systematic approach to a (medical) problem
3
4. Statistical Concept of Research
• Research is a systematic collection, analysis
and interpretation of data in order to solve a
research question
• It is classified as:
– Basic research: necessary to generate new
knowledge and technologies.
– Applied research: necessary to identify priority
problems and to design and evaluate policies and
programs for optimal health care and delivery.
4
5. 1.2. Types of Epidemiological Design
A. Descriptive studies
• Mainly concerned with the distribution of diseases with
respect to time, place and person.
• Useful for health managers to allocate resource and to
plan effective prevention programmes.
• Useful to generate epidemiological hypothesis, an
important first step in the search for disease
determinant or risk factors.
• Can use information collected routinely which are
readily available in many places. So generally descriptive
studies are less expensive and less time-consuming than
analytic studies.
5
6. • It is the most common type of
epidemiological design strategy in medical
literature.
• There are three main types:
– Correlational
– Case report or case series
– Cross-section
6
7. A.1. Correlational or Ecological
• Uses data from entire population to compare disease
frequencies – between different groups during the same
period of time, or in the same population at different
points in time.
• Does not provide individual data, rather presents
average exposure level in the community.
• Cause could not be ascertained.
• Correlation coefficient is the measure of association in
correlational studies. It is important to note that
positive association does not necessarily imply a valid
statistical association.
7
8. Eg.
• Hypertension rates and average per capita salt
consumption compared between two communities.
• Average per capita fat consumption and breast cancer
rates compared between two communities.
• Comparing incidence of dental cares in relation to
fluoride content of the water among towns in the rift
valley.
• Mortality from CHD in relation to per capita cigarette
sales among the regions of Ethiopia.
8
9. • Strength: Can be done quickly and
inexpensively, often using available data.
• Limitation:
– Inability to link exposure with disease.
– Lack of ability to control for effects of potential
confounding factors. There may be other things that
at the true cause.
– It may mask a non-linear relationship between
exposure and disease. For example alcohol
consumption and mortality from CHD have a non-
linear relationship (the curve is “J” shaped),
9
10. A.2. Case Report and Case Series
• Describes the experience of a single or a group of
patients with similar diagnosis. Has limited value,
but occasionally revolutionary.
• E.g. 5 young homosexual men with PCP seen
between Oct. 1980 and May 1981 in Los Angeles
arose concern among physicians. Later, with further
follow-up and thorough investigation of the strange
occurrence of the disease the diagnosis of AIDS
was established for the first time.
10
11. • Strength:
– very useful for hypothesis generation.
• Limitations:
– Report is based on single or few patients, which
could happen just by coincidence. Lack of an
appropriate comparison group
11
12. A.3. Cross Sectional Studies (Survey
• Information about the status of an individual with
respect to the presence or absence of exposure
and disease is assessed at the same point in time.
Easy to do-many surveys are like this.
• For factors that remain unaltered overtime, such
as sex, race or blood group, the cross-sectional
survey can provide evidence of a valid statistical
association.
• Useful for raising the question of the presence of
an association rather than for testing a hypothesis.
12
13. B. ANALYTIC STUDIES
• Focuses on the determinants of a disease by
testing the hypothesis formulated from
descriptive studies, with the ultimate goal of
judging whether a particular exposure causes or
prevents disease.
• Broadly classified into two
– observational and interventional studies.
– Both types use “controls”. The use of controls is the
main distinguishing feature of analytic studies.
13
14. B.1. Observational studies
• Information are obtained by observation of events.
No intervention is done. Cohort and case-control
are in this category.
i. Cohort
• Subjects are selected by exposure, or determinants
of interest, and followed to see
• If they develop the disease or outcome interest.
• E.g. Follow 100 children who received BCG
vaccination and another 100 who didn’t get BCG
vaccination and see how many of them get
tuberculosis.
14
15. • ii. Case Control
• Subjects are selected with respect to presence or
absence of disease, or outcome of interest, and
then inquiries are made about past exposure to
the factor(s) of interest.
• E.g. Take people with and without TB, ask them
if they ever had BCG vaccination.
15
16. B.2. Interventional / Experimental
• The researcher does something about the disease or
exposure and observe the changes.
• Investigator has control over who gets exposure
and who don’t. The key is that the investigator
assign into either group, whether it is done
randomly or not.
• Always prospective.
• E.g. Assign children randomly to get chloroquine or
not, and see how many develop symptomatic
malaria.
16
17. Description of common terms
Statistics- It is the process of scientifically collecting,
organizing, summarizing and interpreting of data, and the
drawing of inferences about a body of data when only part
of the data are observed.
Biostatistics- It is a special statistics in which the data being
analyzed are derived from biological and medical science
Descriptive statistics: A statistical method that is concerned
with the collection, organization, summarization, and
analysis of data from a sample of population.
Inferential statistics: A statistical method that is concerned
with the drawing of inferences/ conclusions about a
particular population by selecting and measuring a random
sample from the population.
17
18. Population: Is the largest collection of entities/values of
a random variable for which we have an interest at a
particular time. Population could be finite or infinite.
We can take the whole number of students in a given
class (e.g. 100 students) as a population.
• Target population: A collection of items that have
something in common for which we wish to draw
conclusions at a particular time.
• Study Population: The specific population from
which data are collected
18
19. Sample: It is some part/subset of population of interest.
In the above example, if we randomly select 25 students
from the 100, we call the former as sample of the class.
Hence, Generalizability is a two-stage procedure: we
want to a generalize from the sample to the study
population and then from the study population to
the target population
19
20. Eg.: In a study of the prevalence
of HIV among orphan children in
Ethiopia, a random sample of
orphan children in LidetaKifle
Ketema were included.
Target Population: All orphan
children in Ethiopia
Study population: All orphan
children in Addis Ababa
Sample: Orphan children in
Lideta KifleKetema
20
21. Statistical inference: It is the procedure by which we reach a
conclusion about a population on the basis of the information
contained in a sample that has been drawn from that population.
Parameter: It is numerical expression of population measurements
E.g. population mean (µ), population variance, population
standard deviation, etc
A descriptive measure computed from the data of a population.
Statistic: A descriptive measure computed from the data of a
sample.
Statistical data: Information that is systematically collected
tabulated and analysis for which the result is interpreted to draw
conclusions about the result obtained.
21
22. • Data: aggregate of variables as a result of measurement or
counting.
• Variable: A characteristics that takes on different values in
different persons, places, or things.
– Dependent variable(response) :variable (s)we measure
as an out come of interest
– Independent variable(predictor) :The variable(S) that
determines the outcome
22
23. Categorical variable: The notion of magnitude is
absent or implicit.
– Nominal: have distinct levels that have no inherent
ordering.
– When only with two categories, are called
binary or dichotomous.Eg. Sex; male or female
– When more than two categories -are called
polythumous eg color
– Ordinal: have levels that do follow a distinct
ordering.
Eg. severity of pain(mild, moderate severe)
23
24. Quantitative(numeric) variable: Variable that has magnitude
• Discrete data: when numbers represent actual measurable
quantities rather than mere labels.
Discrete data are restricted to taking only specified
values often integers or counts that differ by fixed
amounts.
e.g. Number of new AIDS cases reported during one
year period, Number of beds available in a particular
hospital
• Continuous data: represent measurable quantities but are
not restricted to taking on certain specific values i.e
fractional values are possible. Can use interval (no true zero
value) or ratio scale (begins at zero)
– e.g. weight, cholesterol level, time, temperature
24
25. 1.3.Sampling Methods
Sampling
• The process of selecting a portion of the population to represent
the entire population.
• A main concern in sampling:
– Ensure that the sample represents the population, and
• The findings can be generalized.
25
26. Advantages of sampling:
• Feasibility: Sampling may be the only feasible method of
collecting information.
• Reduced cost: Sampling reduces demands on resource such
as finance, personnel, and material.
• Greater accuracy: Sampling may lead to better accuracy of
collecting data
• Sampling error: Precise allowance can be made for sampling
error
• Greater speed: Data can be collected and summarized more
quickly
26
27. Disadvantages of sampling:
• There is always a sampling error.
• Sampling may create a feeling of discrimination within the
population.
• Sampling may be inadvisable where every unit in the population is
legally required to have a record.
Errors in sampling
1) Sampling error: Errors introduced due to selection of a sample.
– They cannot be avoided or totally eliminated.
2) Non-sampling error:
- Observational error
- Respondent error
- Lack of preciseness of definition
- Errors in editing and tabulation of data
27
28. Divisions of Sampling Methods
Two broad divisions:
A. Probability sampling methods
B. Non-probability sampling methods
28
29. 1.4.1. Probability sampling
• Involves random selection of a sample
• A sample is obtained in a way that ensures every member of the
population to have a known, non zero probability of being
included in the sample.
• Involves the selection of a sample from a population, based on
chance.
29
30. • Probability sampling is:
– more complex,
– more time-consuming and
– usually more costly than non-probability
sampling.
• However, because study samples are randomly selected and
their probability of inclusion can be calculated,
– reliable estimates can be produced and
• inferences can be made about the population.
30
31. • There are several different ways in which a probability sample can
be selected.
• The method chosen depends on a number of factors, such as
– the available sampling frame,
– how spread out the population is,
– how costly it is to survey members of the population
31
32. Most common probability sampling methods
1. Simple random sampling
2. Systematic random sampling
3. Stratified random sampling
4. Cluster sampling
5. Multi-stage sampling
32
33. 1. Simple random sampling(SRS)
• Involves random selection
• Each member of a population has an equal chance of being
included in the sample.
• To use a SRS method:
– Make a numbered list of all the units in the population
– Each unit should be numbered from 1 to N
(where N is the size of the population)
– Select the required number.
33
34. • The randomness of the sample is ensured by:
• use of “lottery’ methods
• a table of random numbers
– Using computer programes
• Example
• Suppose your school has 500 students and you need to
conduct a short survey on the quality of the food served in the
cafeteria.
• You decide that a sample of 10 students should be sufficient
for your purposes.
• In order to get your sample, you assign a number from 1 to
500 to each student in your school.
34
35. • To select the sample, you use a table of randomly generated
numbers.
• Pick a starting point in the table (a row and column number)
and look at the random numbers that appear there. In this
case, since the data run into three digits, the random
numbers would need to contain three digits as well.
• Ignore all random numbers after 500 because they do not
correspond to any of the students in the school.
• Remember that the sample is without replacement, so if a
number recurs, skip over it and use the next random
number.
• The first 10 different numbers between 001 and 500 make
up your sample
35
36. • SRS has certain limitations:
– Requires a sampling frame.
– Difficult if the reference population is dispersed.
– Minority subgroups of interest may not be selected.
36
37. 2. Systematic random sampling
• Sometimes called interval sampling, systematic sampling means that
there is a gap, or interval, between each selected unit in the
sample
• The selection is systematic rather than randomly
– Individuals are chosen at regular interval from the sampling
frame. Ideally we randomly select a number to tell us where
to start selecting individuals from the list.
• Important if the reference population is arranged in some order:
– Order of registration of patients
– Numerical number of house numbers
– Student’s registration books
– Taking individuals at fixed intervals (every kth) based on the
sampling fraction, eg. if the sample includes 20%, then every
fifth. 37
38. Steps in systematic random sampling
1. Number the units on your frame from 1 to N (where N is the
total population size).
2. Determine the sampling interval (K) by dividing the number of
units in the population by the desired sample size.
38
39. Steps…
.In order to find one study unit, during survey, it is important to
figure out how many houses must be visited usually through
doing a pilot study.
• Example: Assume you are doing a study involving children under
5. There are 1500 households in all, and you have a required
sample size of 100 children. From a preliminary study you have
done, there is one child every 2.5 households. Normally, if there
were a child in every household, you would visit 100 households.
But because not every household includes a child, you will need
to visit 100 x 2.5 or 250 households to find the required 100
children.
• The sampling interval will therefore be1500/250 or every 6th
household.
39
40. 3. Select a number between one and K at random. This number is
called the random start and would be the first number included
in your sample.
4. Select every Kth unit after that first number
Note: Systematic sampling should not be used when a
cyclic repetition is inherent in the sampling frame.
40
41. Example
To select a sample of 100 from a population of 400, you would need
a sampling interval of 400 ÷ 100 = 4.
Therefore, K = 4.
You will need to select one unit out of every four units to end up with
a total of 100 units in your sample.
Select a number between 1 and 4 from a table of random numbers.
• If you choose 3, the third unit on your frame would be the first
unit included in your sample;
• The sample might consist of the following units to make up a
sample of 100: 3 (the random start), 7, 11, 15, 19...395, 399 (up to
N, which is 400 in this case).
41
42. The main difference with SRS, any combination of 100 units would
have a chance of making up the sample, while with systematic
sampling, there are only four possible samples.
42
43. Advantages .
• Systematic sampling is usually less time consuming and easier to
perform than SRS
• It provides a good approximation to SRS (. i.e. has highest
precision)
• Unlike SRS, systematic sampling can be conducted without a
sampling frame. So, systematic random sampling is useful when
preparing sampling frame is not readily available.
– E.g. In patients attending a health center, where it is not
possible to predict in advance who will be attending
43
44. Disadvantage
• If there is any sort of cyclic pattern in the ordering of the
subjects, which coincides with the sampling interval, the sample
will not be representative of the population.
– May result in systematic error
44
45. 3. Stratified random sampling
• It is done when the population is known to have heterogeneity with
regard to some factors and those factors are used for stratification
• Using stratified sampling, the population is divided into
homogeneous, mutually exclusive groups called strata, and
– A population can be stratified by any variable that is available for all units prior
to sampling (e.g., age, sex, province of residence, income, etc.).
• A separate sample is taken independently from each stratum.
• Any of the sampling methods mentioned in this section (and others
that exist) can be used to sample within each stratum.
45
46. Why do we need to create strata?
• That it can make the sampling strategy more efficient.
• A larger sample is required to get a more accurate estimation if a
characteristic varies greatly from one unit to the other.
• For example, if every person in a population had the same salary, then
a sample of one individual would be enough to get a precise estimate
of the average salary.
• This is the idea behind the efficiency gain obtained with stratification.
– If you create strata within which units share similar characteristics
(e.g., income) and are considerably different from units in other
strata (e.g., occupation, type of dwelling) then you would only need
a small sample from each stratum to get a precise estimate of total
income for that stratum.
46
47. – Then you could combine these estimates to get a precise
estimate of total income for the whole population.
• If you use a SRS approach in the whole population without
stratification, the sample would need to be larger than the total
of all stratum samples to get an estimate with the same level of
precision.
47
48. • Stratified sampling ensures an adequate sample size for sub-
groups in the population of interest.
• When a population is stratified, each stratum becomes an
independent population and you will need to decide the sample
size for each stratum.
48
49. • Equal allocation:
– Allocate equal sample size to each stratum
• Proportionate allocation:
, j = 1, 2, ..., k where, k is
the number of strata and
n
nj = Nj
N
– nj is sample size of the jth stratum
– Nj is population size of the jth stratum
– n = n1 + n2 + ...+ nk is the total sample size
– N = N1 + N2 + ...+ Nk is the total population
size
49
50. 4. Cluster sampling
• Sometimes it is too expensive to spread a sample across the
population as a whole.
• Travel costs can become expensive if interviewers have to
survey people from one end of the country to the other.
• To reduce costs, researchers may choose a cluster sampling
technique
• The clusters should be homogeneous, unlike stratified
sampling where by the strata are heterogeneous
50
51. Steps in cluster sampling
• Cluster sampling divides the population into groups or clusters.
• A number of clusters are selected randomly to represent the total
population, and then all units within selected clusters are
included in the sample.
• No units from non-selected clusters are included in the sample—
they are represented by those from selected clusters.
• This differs from stratified sampling, where some units are
selected from each group.
51
52. Example
• In a school based study, we assume students of the same school are
homogeneous.
• We can select randomly sections and include all students of the
selected sections only
52
53. • As mentioned, cost reduction is a reason for using cluster
sampling.
• It creates 'pockets' of sampled units instead of spreading the
sample over the whole territory.
• Another reason is that sometimes a list of all units in the
population is not available, while a list of all clusters is either
available or easy to create.
53
54. • In most cases, the main drawback is a loss of efficiency when
compared with SRS.
• It is usually better to survey a large number of small clusters instead
of a small number of large clusters.
– This is because neighboring units tend to be more alike, resulting
in a sample that does not represent the whole spectrum of
opinions or situations present in the overall population.
54
55. • Another drawback to cluster sampling is that you do not have total
control over the final sample size.
• Since not all schools have the same number of (say Grade 11)
students and city blocks do not all have the same number of
households, and you must interview every student or household in
your sample, as an example, the final size may be larger or smaller
than you expected.
55
56. 5. Multi-stage sampling
• Similar to the cluster sampling, except that it involves picking a
sample from within each chosen cluster, rather than including all
units in the cluster.
• This type of sampling requires at least two stages.
56
57. • In the first stage, large groups or clusters are identified and
selected. These clusters contain more population units than are
needed for the final sample.
• In the second stage, population units are picked from within the
selected clusters (using any of the possible probability sampling
methods) for a final sample.
57
58. • If more than two stages are used, the process of choosing
population units within clusters continues until there is a final
sample.
• With multi-stage sampling, you still have the benefit of a more
concentrated sample for cost reduction.
• However, the sample is not as concentrated as other clusters and
the sample size is still bigger than for a simple random sample
size.
58
59. • Also, you do not need to have a list of all of the units in the
population. All you need is a list of clusters and list of the units in
the selected clusters.
• Admittedly, more information is needed in this type of sample
than what is required in cluster sampling. However, multi-stage
sampling still saves a great amount of time and effort by not
having to create a list of all the units in a population.
59
60. 1.4.2.. Non-probability sampling
• The difference between probability and non-probability
sampling has to do with a basic assumption about the nature of
the population under study.
• In probability sampling, every item has a known chance of being
selected.
• In non-probability sampling, there is an assumption that there is an
even distribution of a characteristic of interest within the
population.
60
61. • This is what makes the researcher believe that any sample would
be representative and because of that, results will be accurate.
• For probability sampling, random is a feature of the selection
process, rather than an assumption about the structure of the
population.
61
62. • In non-probability sampling, since elements are chosen
arbitrarily, there is no way to estimate the probability of any one
element being included in the sample.
• Also, no assurance is given that each item has a chance of being
included, making it impossible either to estimate sampling
variability or to identify possible bias
62
63. • Reliability cannot be measured in non-probability sampling; the only
way to address data quality is to compare some of the survey results
with available information about the population.
• Still, there is no assurance that the estimates will meet an acceptable
level of error.
• Researchers are reluctant to use these methods because there is no
way to measure the precision of the resulting sample.
63
64. • Despite these drawbacks, non-probability sampling methods can
be useful when descriptive comments about the sample itself are
desired.
• Secondly, they are quick, inexpensive and convenient.
• There are also other circumstances, such as researches, when it
is unfeasible or impractical to conduct probability sampling.
64
65. common types of non-probability sampling
1. Convenience or haphazard sampling
2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
5. Snowball sampling technique
65
66. 1.4.Scales of measurement
• Measurement: the assignment of numbers or names or events
according to a set of rules:
• Clearly not all measurements are the same.
• Measuring an individuals weight is qualitatively different from
measuring their response to some treatment on a three category
of scale, “improved”, “stable”, “not improved”.
• Measuring scales are different according to the degree of
precision involved.
• There are four types of scales of measurement.
66
67. Scales…
1. Nominal scale: uses names, labels, or symbols to assign each
measurement to one of a limited number of categories that
cannot be ordered.
– Examples: Blood type, sex, race, marital status
2. Ordinal scale: assigns each measurement to one of a limited
number of categories that are ranked in terms of a graded order.
– Examples: Patient status, Cancer stages
67
68. Scales…
3. Interval scale: assigns each measurement to one of an unlimited
number of categories that are equally spaced. It has no true zero
point.
– Example: Temperature measured on Celsius or Fahrenheit
4.Ratio scale: measurement begins at a true zero point and the
scale has equal space.
– Eg: Height, weight, blood pressure
68
70. 1.5.Validity and reliability
Validity and Reliability are two major
requirements for any measurement.
– Validity pertains to the correctness of the
measure; a valid tool measures what it is
supposed to measure.
– Reliability pertains to the consistency of the tool
across different contexts.
• Validity is often described as internal or
external.
70
71. 1.6.Sources and methods of data Collection and
it’s handling
Sources
Two major sources
Primary sources-are those data, which are collected by the
investigator himself/herself for the purpose of a specific inquiry or
study.
Such data are original in character and are mostly generated by surveys
conducted by individuals or research institutions.
The first hand information obtained by the investigator is more reliable
and accurate since the investigator can extract the correct information
by removing doubts, if any, in the minds of the respondents regarding
certain questions. High response rates might be obtained since the
answers to various questions are obtained on the spot. It permits
explanation of questions concerning difficult subject matter.
71
72. Secondary data
Secondary Data: When an investigator uses data, which have
already been collected by others, such data are called "Secondary
Data". Such data are primary data for the agency that collected
them, and become secondary for someone else who uses these
data for his/her own purposes.
The secondary data can be obtained from journals, reports of
different institutions, government publications, publications of
professionals and research organizations. These data are less
expensive and can be collected in a short time.
72
73. Data collection methods
1.Observation
• is a technique that involves systematically selecting, watching and
recoding behaviours of people or other phenomena and aspects
of the setting in which they occur, for the purpose of getting
specified information.
• includes all methods from simple visual observations to the use
of high level machines and measurements, sophisticated
equipment or facilities, such as radiographic, biochemical, X-ray
machines, microscope, clinical examinations, and microbiological
examinations.
73
74. Observation…
• Advantages: Gives relatively more accurate data on behaviour
and activities
• Disadvantages: Investigators or observer’s own biases, prejudice,
desires, and etc. .
• needs more resources and skilled human power during the use of
high level machines.
74
75. 2. The Documentary sources
• Include clinical records and other personal records, published
mortality statistics, census publications, etc.
• Advantages:
a) Documents can provide ready-made information relatively easily
b) The best means of studying past events
• Disadvantages:
a) Problems of reliability and validity (because the information is
collected by a number of different persons who may have used
different definitions or methods of obtaining data).
b) There is a possibility that errors may occur when the information
is extracted from the records .
75
76. 3. Interviews and self-administered questionnaire
a) Interviews: may be less or more structured.
A public health worker conducting interviews may be armed with a
checklist of topics, but may not decide in advance precisely what
questions he/she will ask.
• This approach is flexible; the content, wording and order of the
questions are relatively unstructured.
– the content, wording and order of the questions vary from interview to
interview.
76
77. Interviews…
On the other hand, in other situations a more standardized technique may
be used, the wording and order of the questions being decided in
advance.
This may take the form of a highly structured interview(interviewing using
questionnaire),
• the investigator appoints persons/enumerators, who go to the
respondents personally with the questionnaire, ask them questions and
record their replies.
– This can be done using telephone or face-to-face interviews.
77
78. Interviews…
• Questions may take two general forms: they may be “open
ended” questions, which the subject answers in his/her own
words,
• or “closed” questions, which are answered by choosing from a
number of fixed alternative responses.
78
79. Advantage of interview
• A good interviewer can stimulate and maintain the respondent’s
interest. This leads to the frank answering of questions.
• If anxiety is aroused (e.g., why am I being asked these
questions?) , the interviewer can allay it.
An interviewer:
• can repeat questions which are not understood, and give
standardized explanations where necessary.
• can ask “follow-up” or “probing” questions to clarify a
response.
• can make observations during the interview;
• i.e., note is taken not only of what the subject says but also
how he/she says it.
79
80. b. self-administered questionnaire
• The respondent reads the questions and fills in the answers by
himself/herself (sometimes in the presence of an interviewer
who “stands by” to give assistance if necessary).
• The use of self-administered questionnaires is simpler and
cheaper;
• can be administered to many persons simultaneously (e.g. to
a class of school children).
• They can be sent by post. However, they demand a certain
level of education on the part of the respondent.
80
81. .
• Quantitative data are commonly collected using structured
interviews (where standard questionnaires are common and the
collected data can relatively be processed easily) where as,
• qualitative data are usually collected using unstructured
interviews.
• The unstructured interviews are undertaken by the help of check
lists, key informant interviews, focus group discussions, etc.
81
82. Qualitative…
Checklist - is a list of questions prepared ahead of time to facilitate
the interviews or discussions. It is not an exhaustive one. It helps
the facilitator not to miss any of the important topics under
consideration.
Key informant interviews – interviews done with influential
individuals (such as community elders, priests, etc.).
Focus group discussions – discussions made with a group of
respondents.
• The group contains 6 to 12 people who are more or less similar
with respect to level of education, marital status, age, sex, etc.
(this composition helps each respondent to talk freely without
being dominated by the other).
82
83. Steps in Questionnaire Design
1. Before beginning to construct, make sure that the questionnaire
is the best method of collecting data for your objectives
– To know before hand what information is needed and what is
going to be done with this information
2. While drafting the questions one has to know: Why question is
asked and what will be done with information (to prevent
wastage of extra resources)
83
84. Steps in…
3. To get valid and reliable information:
• the wording and sequence of question should be able to
facilitate their recall or remember
• prevent forgetfulness of the respondents
• avoid difficult/ time consuming or embarrassing or too
personal question
• the flow of questions should be from simple to complex
and from general to specific, from impersonal to personal
• confidentiality care should be taken for the respondent
• Cover letter( if by mail)
• Identify by ID(rather than name)
84
86. Data collection
A plan for data collection can be made in two steps:
1. Listing the tasks that have to be carried out and who
should be involved, making a rough estimate of the time needed
for the different parts of the study, and identifying the most
appropriate period in which to carry out the research
2. Actually scheduling the different activities that have to
be carried out each week in a work plan
86
87. Why should you develop a plan for data
collection?
A plan for data collection should be developed so that:
– you will have a clear overview of what tasks have to be carried out,
who should perform them, and the duration of these tasks;
– you can organize both human and material resources for data
collection in the most efficient way; and
– you can minimize errors and delays which may result from lack of
planning (for example, the population not being available or
data forms being misplaced).
87
88. Data collection process
Stages
• Stage 1: Permission to proceed
– Obtaining consent from the relevant authorities,
individuals and the community in which the project
is to be carried out
88
89. Data collection process
Stage 2: Data collection
• Logistics
– who will collect what,
– when and
– with what resources
• Quality control
– Prepare a field work manual
– Select your research assistants
– Train research assistants
– Supervision
– Checked for completeness and accuracy 89
90. Data collection process
• How long will it take to collect the data for each
component of the study?
– Step 1: Consider the time required to reach the study
area; to locate the study units; the number of visits
required per study unit and for follow-up of non-
respondents
– Step 2: Calculate the number of interviews that can
be carried out per person per day
– Step 3: Calculate the number of days needed to carry
out the interviews.
90
91. Ensuring data quality
Measures to help ensure good quality of data:
Prepare a field work manual for the research team
as a whole
Select your research assistants, if required, with
care
Train research assistants carefully in all topics
covered in the field work manual as well as in
interview techniques
Pre-test research instruments and research
procedures with the whole research team,
including research assistants.
91
92. Ensuring data quality
Take care that research assistants are not placed
under too much stress
Arrange for on-going supervision of research
assistants and guidelines should be developed
for supervisory tasks.
Devise methods to assure the quality of data
collected by all members of the research team.
92
93. Data Collection Process
Stage 3: Data handling
• Once the data have been collected and checked for
completeness and accuracy, a clear procedure should be
developed for handling and storing them
• Numbering of all questionnaires
• Identify the person responsible for storing data and the
place where it will be stored
• Decide how data should be stored. Record forms
should be kept in the sequence in which they have been
numbered.
93
94. Research Assistants
• This includes – data collectors, supervisors and
may be local guides
• Selection – during selection one should consider
similarities in educational level and may be sex
composition
• Training – all research assistants and team
members should be trained together
94
95. Pre-test and pilot study
A pre-test usually refers to a small-scale trial of particular
research components.
A pilot study is the process of carrying out a preliminary
study, going through the entire research procedure with a small
sample.
Why do we carry out a pre-test or pilot study?
A pre-test or pilot study serves as a trial run that allows us
to identify potential problems in the proposed study.
95
96. Pre-test and pilot study
What aspects of your research methodology can be
evaluated during pre-testing?
1. Reactions of the respondents to the research
procedures can be observed in the pre-test – availability
and willingness
2. The data-collection tools can be pre-tested
3. Sampling procedures can be checked
4. Staffing and activities of the research team can be
checked, while all are involved in the pre-test
5. Procedures for data processing and analysis can be
evaluated during the pre-test
6. The proposed work plan and budget for research
activities can be assessed during the pre-test.
96
97. Plan for data processing & analysis
• Data processing and analysis should start in the
field, with checking for completeness of the data
and
• Performing quality control checks, while sorting
the data by instrument used and by group of
informants
• Data of small samples may even be processed and
analyzed as soon as it is collected.
97
98. Plan for data processing & analysis
• The plan for data processing and analysis must be made
after careful consideration of the objectives of the study as well as
of the tools developed to meet the objectives.
• The procedures for the analysis of data collected
through qualitative and quantitative techniques are quite
different.
– For quantitative data the starting point in analysis is usually a
description of the data for each variable
– For qualitative data it is more a matter of describing,
summarizing and interpreting the data obtained for each
study unit
98
99. Plan for data processing & analysis
• When making a plan for data processing and
analysis the following issues should be
considered:
– Sorting data,
– Performing quality-control checks,
– Data processing, and
– Data analysis.
99
100. Data processing and analysis
• Sorting data
– Into groups of different study populations or
comparison groups
• Quality control checks
– Check again for completeness and internal
consistency
– Missing data - if many exclude the questionnaire
– Inconsistency - correct, return or exclude
100
101. Data processing
• Decide whether to process and analyse the data from
questionnaires:
– manually, using data master sheets or manual compilation of
the questionnaires, or
– by computer, for example, using a micro-computer and
existing software or self-written programmes for data
analysis.
• Data processing in both cases involves:
• categorising the data,
• coding, and
• summarising the data in data master sheets, manual compilation
without master sheets, or
• data entry and verification by computer.
101
103. 2.Data summarization(Descriptive statistics)
2.1.Describing variables
The methods of describing variables differ depending on the
type of data
Categorical or Numerical
Some times we transform numeric data into categorical.eg
age.
– when lesser degree detail is required
• This is achieved by dividing the range of values, which the
numeric variable takes into intervals.
103
104. Describing…
Categorical variables
• Table of frequency distributions
– Frequency
– Relative frequency
– Cumulative frequencies
• Charts
– Bar charts
– Pie charts
104
106. In summary,
• There are three ways we can summarize and present data:
• Tabular representation - summarizing data by making a table of
the data called frequency distributions.
• Graphical representation of data - we can make a graph of the
data.
• Numerical representation of data - we can use a single number to
represent many numbers.
– Measures of central tendency.
– Measures of variability.
106
107. 2.2. Frequency Distribution
• A frequency distribution shows the number of observations falling
into each of several ranges of values.
• Four different types of frequency distributions.
– Simple frequency distribution (or it can be just called a
frequency distribution).
– Cummulative frequency distribution.
– Grouped frequency distribution.
– Cummulative grouped frequency distribution.
• Are portrayed as Frequency tables, histograms, or polygons
• Can show either the actual number of observations falling in each
range or the percentage of observations. In the latter instance, the
distribution is called a relative frequency distribution
107
108. Simple frequency distribution
Consider the following set of data which are the high
temperatures recorded for 30 consecutive days. We wish
to summarize this data by creating a frequency
distribution of the temperatures.
Data Set - High Temperatures for 30 Days
50 45 49 50 43
49 50 49 45 49
47 47 44 51 51
44 47 46 50 44
51 49 43 43 49
45 46 45 51 46
108
109. Simple frequency distribution…
To create a frequency distribution from this
data proceed as follows:
.
1. Identify the highest and lowest values in the
data set. For our temperatures the highest
temperature is 51 and the lowest temperature is
43.
2. Create a column with the title of the variable we
are using, in this case temperature. Enter the
highest score at the top, and include all values
within the range from the highest score to the
lowest score.
109
110. Simple frequency…
3. Create a tally column to keep track of the scores as you
enter them into the frequency distribution. Once the
frequency distribution is completed you can omit this
column
4. Create a frequency column, with the frequency of each
value, as show in the tally column, recorded.
5. At the bottom of the frequency column record the total
frequency for the distribution proceeded by N =
6. Enter the name of the frequency distribution at the top
of the table.
110
111. Simple frequency…
If we applied these steps to the temperature data above
we would have the following frequency distribution
Frequency Distribution for High Temperatures
Temperature Tally Frequency
51 //// 4
50 //// 4
49 //// / 6
48 0
47 /// 3
46 /// 3
45 //// 4
44 /// 3
43 /// 3
N = 30
111
112. Cumulative frequency distribution
To create a cummulative frequency distribution:
• Create a frequency distribution
• Add a column entitled cummulative frequency
• The cummulative frequency for each score is the
frequency up to and including the frequency for that
score
• The highest cummulative frequency should equal N
(the total of the frequency column)
112
113. Cumulative frequency…
Cummulative Frequency Distribution for High Temperatures
Temperature Tally Frequency Cummulative Frequency
51 //// 4 30
50 //// 4 26
49 ////// 6 22
48 0 16
47 /// 3 16
46 /// 3 13
45 //// 4 10
44 /// 3 6
43 /// 3 3
N= 30
113
114. Grouped frequency distribution
To create a grouped frequency distribution:
• select an interval size so that you have 7-20 class intervals
Al so By using surges’ rule
• create a class interval column and list each of the class
intervals
• each interval must be the same size, they must not overlap,
there may be no gaps within the range of class intervals
• create a tally column (optional)
• create a midpoint column for interval midpoints
• create a frequency column
• enter N = some value at the bottom of the frequency
column
114
115. Grouped frequency for the temperature data
Grouped Frequency Distribution for High Temperatures
Class Interval Tally Interval Midpoint Frequency
57-59 ////// 58 6
54-56 /////// 55 7
51-53 /////////// 52 11
48-50 ///////// 49 9
45-47 /////// 46 7
42-44 ////// 43 6
39-41 //// 40 4
N= 50
115
116. Cumulative grouped frequency distribution
We just add a cumulative frequency column to the grouped
frequency distribution and we have a cumulative grouped
frequency distribution as shown below.
Cumulative Grouped Frequency Distribution for High Temperatures
Class Interval Tally Interval Midpoint Frequency Cumulative Frequency
57-59 ////// 58 6 50
54-56 /////// 55 7 44
51-53 /////////// 52 11 37
48-50 ///////// 49 9 26
45-47 /////// 46 7 17
42-44 ////// 43 6 10
39-41 //// 40 4 4
N= 50
116
117. Relative Frequency
• Sometimes it is useful to compute the proportion, or percentages of
observations in each category.
• Relative frequency of a particular category is the proportion(fracttion)
of observations that fall into the particular category.
• The cumulative frequency (or proportions) is addition of the
frequencies in each category from zero to a particular category.
– Is the relative frequency of items less than or equal to the upper class
limit of each class.
• For quantitative data and for categorical (qualitative) data (but only if the
latter are ordinal )
117
118. Characteristics and guidelines of table
construction
Characteristics
• Table must be explanatory
• Title should describe the content of the table and should answer
the question what? Where? And when? It was collected
• Percentages in each category should add up to 100
• Foot notes should be placed at the bottom of the table
118
119. Guidelines
• The shape and size of the table should contain the required
number of raw and Columns to accommodate the whole data
• If a quantity is zero, it should be entered as zero, and leaving
blank space or putting dash in place of zero is confusing and
undesirable
• In case two or more figures are the same, ditto marks should not
be used in a table in the place of the original numerals
• If any figures in a table has to be specified for a particular
purpose, it should be marked with asterisk
119
120. 2.3. Diagrammatic Representation
2.3.1. Importance of diagrammatic representation:
1.Diagrams have greater attraction than mere figures. They give
delight to the eye, add a spark of interest and as such catch the
attention as much as the figures dispel it.
2.They help in deriving the required information in less time and
without any mental strain.
3.They have great memorizing value than mere figures. This is so
because the impression left by the diagram is of a lasting nature.
4.They facilitate comparison
120
121. Importance….
Well designed graphs can be an incredibly powerful means of
communicating a great deal of information using visual
techniques
When graphs are poorly designed, they not only do not effectively
convey your message, they often mislead and confuse.
121
122. 2.3.2.Types
1. Bar graph
•Bar diagram is the easiest and most adaptable general
purpose chart.
•Though this type of chart can be used for any type of series,
it is especially satisfactory for nominal and ordinal data.
•The categories are represented on the base line (X-axis) at
regular interval and the corresponding values of frequencies
or relative frequencies represented on the Y-axis (ordinate)
in the case of vertical bar diagram and vis-versa in the case
of horizontal bar diagram.
122
123. Method of constructing bar graph
•All bars drawn in any single study should be of the same width
•The different bars should be separated by equal distances
•All the bars should rest on the same line called the base
•It is better to construct a diagram on a graph paper
Types of bar graph
• 1.Simple bar graph: It is one-dimensional diagram in which the
bar represents the whole of the magnitude. The height/length of
each bar indicates the frequency of the figure represented.
Example: Construct a bar graph for the following data
123
124. Table__, Distribution of pediatric patients in X hospital ward by type of
admitting diagnosis Jan, 2000
Diagnosis Number of patients Relative freq (%)
Pneumonia 487 48.7
Malaria 200 20
Cardiac problems 168 16.8
Malnutrition 80 8.0
Others 65 6.5
Total 1000 100
124
126. 2.Sub-divided (component) bar graph
• It is also called segmented bar graph. If a given magnitude can be
split up into subdivisions, or if there are different quantities
forming the subdivisions of the totals, simple bars may be
subdivided in the ratio of the various subdivisions to exhibit the
relationship of the parts to the whole.
• The order in which the components are shown in a "bar" is
followed in all bars used in the diagram.
126
128. 3. Multiple bar graph
Multiple Bar diagrams can be used to represent the
relationships among more than two variables.
The following figure shows the relationship
between children’s reports of breathlessness and
cigarette smoking by themselves and their
parents.
128
130. 3. Multiple bar graph…
• We can see from the graph quickly that the prevalence of the
system increases both with the child's smoking and with that of
their parents.
130
131. 2. Pie chart
Pie chart shows the relative frequency for each category by dividing
a circle into sectors, the angles of which are proportional to the
relative frequency.
Steps to construct a pie-chart
Construct a frequency table
Change the frequency into percentage (P)
Change the percentages into degrees, where: degree =
Percentage X 360o
Draw a circle and divide it accordingly
131
132. 2. Pie chart…
Example: Distribution of death for females, in England and Wales, 1989.
Cause of death Number (%)of deaths
Circulatory system (C) 100,000
• Neoplasm (N)
-- 70,000
Respiratory system(R) 30,000
Injury & poisoning (I) 6,000
Digestive system (D) 10,000
Others (O) 20,000
Total 236,000
132
134. 3.Histogram
Histograms are frequency distributions with continuous class
interval that have been turned into graphs.
To construct a histogram, we draw the interval boundaries on a
horizontal line and the frequencies on a vertical line.
Non-overlapping intervals that cover all of the data values must be
used.
Bars are then drawn over the intervals in such a way that the areas
of the bars are all proportional in the same way to their interval
frequencies.
134
135. Example: Distribution of the RBC cholinesterase values
(µmol/min/ml) obtained from 35 workers Exposed to Pesticides
eg. RBC cholinesterase (µmol/min/ml) Frequency, n (%) Cumulative frequency (%)
5.95-7.95 1(2.9) 2.9
7.95-9.95 8(22.9) 25.8
9.95-11.95 14(40) 65.8
11.95-13.95 9(25.7) 91.5
13.95-15.95 2(5.7) 97.2
15.95-17.95 1(2.9) 100
Total 35(100)
Source: Knapp RG, Miller MC III: Clinical Epidemiology and biostatistics
135
136. 3.Histogram…
Histogram of the RBC cholinesterase values of 35
• . Number of pesticide exposed workers
pesticide exposed workers
16
14
12
10
8
6
4
2
0
6.95 8.95 10.95 12.95 14.95 16.95
RBC choilinesterase(umol/min/ml)
136
137. 4.Frequency polygon
A frequency distribution can be portrayed graphically in yet another way
by means of a frequency polygon.
•To draw a frequency polygon we connect the mid-point of the tops of
the cells of the histogram by a straight line.
•It can be also drawn without erecting rectangles as follows:
The scale should be marked in the numerical values of the mid-points
of intervals.
Erect ordinates on the mid-point of the interval-the length or altitude of
an ordinate representing the frequency of the class on whose mid-point
it is erected.
Join the tops of the ordinates and extend the connecting line to the scale
of sizes.
137
139. 5.Cumulative frequency polygon (ogive curve)
Some times it may become necessary to know the number of items
whose values are more or less than a certain amount.
•We may, for example, be interested in knowing the number of
patients whose weight is less than 50 Kg or more than say 60 Kg.
•To get this information it is necessary to change the form of the
frequency distribution from a ‘simple’ to ‘cumulative'
distribution.
•Ogive curve turns a cumulative frequency distribution in to
graphs.
139
140. 5.Cumulative frequency polygon (ogive curve)…
Example: Heart rate of patients admitted to Hospital B, 2000
Heart rate No. of patients Cumulative freq., less Cumulative freq.,
(Beat/min) than method greater than method
54.95-59.5 1 1 54
59.5-64.5 5 6 53
64.5-69.5 3 9 48
69.5-74.5 5 14 45
74.5-79.5 11 25 40
79.5-84.5 16 41 29
84.5-89.5 5 46 13
89.5-94.5 5 51 8
94.5-99.5 2 53 3
99.5-104.5 1 54 1
Total 54
140
142. 6.Box-and-whisker plot
It is another way to display information when the objective is to
illustrate certain location in the distribution.
A box is drawn with the top of the box at the third quartile and the
bottom at the first quartile.
The location of the midpoint of the distribution is indicated with a
horizontal line in the box.
Finally, straight lines or whiskers are drawn from the center of the
top of the box to the largest observation and from the center of
the bottom of the box to the smallest observation.
Useful When one of the characteristics is qualitative and the other is
quantitative
142
145. Box-and-whisker plot
• The graphs indicate the similarity of the distribution
between the percentage saturation of bile in men and
women.
•Again, we see that percentage saturation of bile is a bit more
spread out among women with range 35 to 146 but we see
also that the mid-points of the distributions are almost the
same and that most of the spread in values in women
occurs in the upper half of the distribution.
145
146. 7.Scatter plot
Most studies in medicine involve measuring more than one
characteristic, and graphs displaying the relationship between
two characteristics are common in the literature.
• To illustrate the relationship between two characteristics when
both are quantitative variables we use bivariate plots (also called
scatter plots or scatter diagrams).
A scatter diagram is constructed by drawing X-and Y-axes.
•Each observation is represented by a point or dot(•).
•In the same study on percentage saturation of bile, information
was collected on the age of each patient to see whether a
relationship existed between the two measures, the following
plot was displayed.
146
147. 7.Scatter plot…
The graph suggests the possibility of a positive relationship
between age and percentage saturation of bile in women.
147
148. 8.Line graph
In this type of graph, we have two variables under consideration
like that of scatter diagram.
•A variable is taken along X-axis and the other along Y-axis.
•The points are plotted and joined by line segments in order.
•These graphs depict the trend or variability occurring in the data.
•Sometimes two or more graphs are drawn on the same graph
paper taking the same scale so that the plotted graphs are
comparable.
Example:
The following graph shows level of zidovudine(AZT) in the blood
of AIDS patients at several times after administration of the
drug, with normal fat absorption and with fat mal absorption.
148
151. Measures of central tendency
On the scale of values of a variable there is a certain stage at which
the largest number of items tend to cluster.
Since this stage is usually in the centre of distribution, the tendency
of the statistical data to get concentrated at certain values is
called “central tendency”
The various methods of determining the actual value at which the
data tends to concentrate are called measures of central tendency.
151
152. Measures of central tendency…
The most important objective of calculating measure of central
tendency is to determine a single figure which may be used
to represent a whole series involving magnitude of the same
variable.
In that sense it is an even more compact description of the
statistical data than the frequency distribution.
•Since a measure of central tendency represents the entire data,
it facilitates comparison with in one group or between
groups of data.
152
153. Measures of central tendency…
Characteristics of a good measure of central tendency
A measure of central tendency is good or satisfactory if it
possesses the following characteristics.
1.It should be based on all the observations
2.It should not be affected by the extreme values
3.It should be as close to the maximum number of values as
possible
4.It should have a definite value
5.It should not be subjected to complicated and tedious
calculations
6.It should be capable of further algebraic treatment
7.It should be stable with regard to sampling
153
154. Arithmetic mean (x)
The most familiar MCT is the AM. It is also popularly known
as average.
a) Ungrouped data
If x1.,x2., ..., xn are n observed values,
Then:
154
155. Arithmetic mean…
b) Grouped data .In calculating the mean from grouped data, we
assume that all values falling into a particular class interval are
located at the mid-point of the interval. It is calculated as follow:
where, k = the number of class intervals
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
155
157. Arithmetic mean…
• The arithmetic mean possesses the following properties.
• Uniqueness: For given set of data there is one and only one
arithmetic mean.
• Simplicity: The arithmetic mean is easily understood and
easy to compute.
• Center of gravity: Algebraic sum of the deviations of the
given values from their arithmetic mean is always zero.
• Sensitivity: The arithmetic mean possesses all the
characteristics of a central value, except No.2, (is greatly
affected by the extreme values).
• In case of grouped data if any class interval is open,
arithmetic mean can not be calculated
157
158. The Median(X)
• a) Ungrouped data
•The median of a finite set of values is that value which divides the
set of values in to two equal parts such that the number of values
greater than the median is equal to the number of values less
than the median.
•If the number of values is odd, the median will be the middle value
when all values have been arranged in order of magnitude.
•When the number of observations is even, there is no single
middle observation but two middle observations.
•In this case the median taken to be the mean of these two
middle observations, when all observations have been
arranged in the order their magnitude
158
159. The Median…
b) Grouped data
• In calculating the median from grouped data, we assume that the
values within a class-interval are evenly distributed through the
interval.
• The first step is to locate the class interval in which it is located.
We use the following procedure.
• Find n/2 and see a class interval with a minimum cumulative
frequency which contains n/2.
• To find a unique median value, use the following interpolation
formal.
159
160. Median…
Where,Lm= lower true class boundary of the interval containing
the median
Fc= cumulative frequency of the interval just above the median
class interval
fm= frequency of the interval containing the median
W= class interval width
n = total number of observations
160
161. Median…..
Example
n/2 = 75/2 = 37.5
Median class interval = 35-44
Lm=34.5 ,Fc= 35, W = 10, n = 75,fm=22
•Median = 34.5 + (37.5-35)/22 x 10 = 35.64
161
162. Properties of the median
• There is only one median for a given set of data
• The median is easy to calculate
• Median is a positional average and hence it is not drastically
affected by extreme values
• Median can be calculated even in the case of open end
intervals
• It is not a good representative of data if the number of
items is small
162
163. Mode (x)
a) Ungrouped data
•It is a value which occurs most frequently in a set of values.
•If all the values are different there is no mode, on the other hand, a
set of values may have more than one mode.
b) Grouped data
• In designating the mode of grouped data, we usually refer to the
modal class, where the modal class is the class interval with the
highest frequency.
• If a single value for the mode of grouped data must be specified,
it is taken as the mid point of the modal class interval.
163
164. Properties of mode
• It is not affected by extreme values
• It can be calculated for distributions with open end classes
• Often its value is not unique
• The main drawback of mode is that often it does not exist
164
165. MEASURES OF POSITIONS
Quartiles
• Divide the distribution into four equal parts. The 25th percentile
demarcates the first quartile (Q1),
• the median or 50th percentile demarcates the second quartile (Q2),
• the 75th percentile demarcates the third quartile (Q3),
• and the 100th percentile demarcates the fourth quartile (Q4), which is
the maximum observation.
Q1 is the ¼ (n+1)th measurement, i.e, 25% of all the ranked observations
are less than Q1.
Q2 is 2/4 (n+1)th = (n+1 /2)th measurement. I.e. 50% of all ranked
observations are less than Q2. Q2=2 Q1
Q3 is the ¾ (n+1)th observation. Q3= 3 Q1. It indicates that 75% of all the
ranked observations are less than Q3.
165
166. Percentile
• Is Simply dividing the data into 100 pieces.
• value in a set of data that has 100% of the observations at or
below it. When we consider it in this way, we call it the 100th
percentile.
• From this same perspective, the median, which has 50% of the
observations at or below it, is the 50th percentile.
• The pth percentile of a distribution is the value such that p percent
of the observations are less than or equal to it.
The pth percentile value depends on whether np/100 is an integer or
not:
The (k+1) Th largest sample point if np/100 is not an integer where k
is the largest integer less than np/100.
The average of the (np/100) th and (np/100+1) th largest observation
when np/100 is an integer
166
167. Percentiles…
Example: The following data is the sample of birth weights (grams) of live births
at a hospital during a week period.
3265, 3248, 2838, 3323, 3245, 3101, 2581, 3200, 4146, 2759, 3609, 2069, 3260,
3314, 3541, 3649, 3484, 2834, 2841, 3031.
Calculate the 10th and 90th percentiles
Solution: n=20; p=0.1 & 0.9 First put the data in ascending order
2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245, 3248, 3260, 3265,
3314,3323,3484,3541,3609,3649,4146.
10th percentile = np/100= 20x0.1=2 which is an integer. So, the 10 th percentile
will be the average of the 2nd and the 3rd ordered observation which is 2581+
2759 divided by two which is equal to 2670 grams.
The 90th percentile=np/100= 20x0.9=18 which is an integer. So, the 90 th
percentile will be the average of the 18 th and the 19th ordered observation
which is 3609+ 3649 divided by two which is equal to 3629 grams.
167
168. Percentiles…
• Therefore, we would say that 80 percent of the birth weights
would fall between 2607 g and 3629 g, which give us an overall
feel for the spread of the distribution.
• The most commonly used percentiles other than the median (50th
percentile) are the 25th percentile and the 75th percentile.
168
169. Measures of variability
• The measure of central tendency alone is not enough to have a
clear idea about the distribution of the data.
• Moreover, two or more sets may have the same mean and/or
median but they may be quite different.
• Thus to have a clear picture of data, one needs to have a measure
of dispersion or variability (scatterdness) amongst observations
in the set.
169
170. Range (R)
R = XL-XS,
where
• XLis the largest value and XSis the smallest value.
• Properties
• It is the simplest measure and can be easily understood
• It takes into account only two values which causes it to be a poor
measure of dispersion
170
171. Interquartilerange (IQR)
IQR = Q3-Q1,
Where,
Q3is the third quartile and Q1is the first quartile.
Example: Suppose the first and third quartile for weights of girls 12
months of age are 8.8 Kg and 10.2 Kg respectively. The
interrquartile range is therefore,
IQR = 10.2 Kg –8.8 Kg,
i.e.,50% of infant girls at 12 months weigh between 8.8 and 10.2
Kg.
171
173. Interquartile…
• Generally, we use interquartile range to describe variability when
we use the median as the measure of central location. We use the
standard deviation, which is described in the next section, when
we use the mean.
Properties
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on two specific values
• It is important in selecting cut-off points in the formulation of
clinical standards
• Since it excludes the lowest and highest 25% values, it is not
affected by extreme values
• It is not capable of further algebraic treatment
173
174. Quartile deviation (QD)
Coefficient of quartile deviation (CQD)
CQD is an absolute quantity (unit less) and is useful to
compare the variability among the middle 50%
observations.
174
175. Mean deviation (MD)
•Mean deviation is the average of the absolute deviations taken
from a central value, generally the mean or median.
•Consider a set of n observations x1, x2, ..., xn.
Then,
Where, A is a central value (arithmetic mean or median).
175
176. Mean deviation …
Properties
• MD removes one main objection of the earlier measures, that it
involves each value
• It is not affected much by extreme values
• Its main drawback is that algebraic negative signs of the
deviations are ignored which is mathematically unsound
• MD is minimum when the deviations are taken from median.
176
177. The Variance (σ2, S2)
• The main objection of mean deviation, that the negative signs are
ignored, is removed by taking the square of the deviations from
the mean.
• The variance is the average of the squares of the deviations taken
from the mean.
177
181. Variance…
Properties
• The main demerit of variance is, that its unit is the square of the
unit of measurement of variate values
• The variance gives more weightage to the extreme values as
compared to those which are near to mean value, because the
difference is squared in variance.
• The drawbacks of variance are overcome by the standard
deviation.
181
182. Standard deviation (σ, S)
It is the positive square root of the variance.
Properties
•Standard deviation is considered to be the best measure of
dispersion and is used widely because of the properties of the
theoretical normal curve.
•There is however one difficulty with it. If the units of
measurements of variables of two series is not the same, then
there variability can not be compared by comparing the values of
standard deviation.
Formula sheet for variance and standard deviation.doc
Example to calculate variance.doc
182
183. Coefficient of variation
• When we desire to compare the variability in two sets of
data, the standard deviation which calculates the absolute
variation may lead to false results.
• The coefficient of variation gives relative variation & is the
best measure used to compare the variability in two sets of
data. Never use SD to compare variability between groups.
• CV = standard deviation
Mean
183
184. 4.Basic Probability and probability
distributions
• Probability is a mathematical technique for predicting
outcomes. It predicts how likely it is that specific events will
occur.
• An understanding of probability is fundamental for quantifying
the uncertainty that is inherent in the decision-making process
• Probability theory also allows us to draw conclusions about a
population of patients based on known information about a
sample of patients drawn from that population.
184
185. Basic Probability…
• Mutually exclusive events: Events that cannot occur together
– For example, event A=“Male” and B=“Pregnant” are two
mutually exclusive events (as no males can be pregnant).
• Independent events: The presence or absence of one does not
alter the chance of the other being present.
– one event happens regardless of the other, and its outcome is
not related to the other.
• Probability: If an event can occur in N mutually exclusive and
equally likely ways, and if m of these possess a characteristic E,
the probability of the occurrence of E is P(E) = m/N.
185
186. 4.1.Properties of probability
1.A probability value must lie between 0 and 1, 0≤P(E)≤1.
A probability can never be more than 1.0, nor can it be negative
• A value 0 means the event can not occur
• A value 1 means the event definitely will occur
• A value of 0.5 means that the probability that the event will
occur is the same as the probability that it will not occur.
• Probability is measured on a scale from 0 to 1.0 as shown in in
the following Figure of probabilty scale.
186
188. Properties…
2. The sum of the probabilities of all mutually exclusive outcome is
equal to 1.
P(E1) + P(E2) + .... + P(En) = 1
3. For any two events A and B,
P(A or B) = P(A) + P(B) -P(A and B)
(Addition rule)
For two mutually exclusive events A and B,
P(A or B ) = P(A) + P(B).
4. For any two independent events A and B
– P(A and B) = P(A) P(B).
(Multiplication rule)
188
189. Properties…
• To calculate the probability of event (A) and event (B) happening
(independent events)for example, if you have two identical packs
of cards (pack A and pack B),what is the probability of drawing
the ace of spades from both packs?
• Formula: P(A) x P(B)
P(pack A) = 1 card, from a pack of 52 cards = 1/52 = 0.0192
P(pack B) = 1 card, from a pack of 52 cards = 1/52 = 0.0192
P(A) x P(B) = 0.0192 x 0.0192 = 0.00037
5. If A’ is the complementary event of the event A,
Then, P(A’) = 1 -P(A).
189
190. Example
• A study investigating the effect of prolonged exposure to bright
light on retina damage in premature infants. Eighteen of 21
premature infants, exposed to bright light developed retinopathy,
while 21 of 39 premature infants exposed to reduced light level
developed retinopathy. For this sample, the probability of
developing retinopathy is:
P(Retinopathy) = No. of infants with retinopathy
Total No. of infants
= 18 + 21 = 0.65
21 + 39
190
191. Example…
• The following data are the results of electrocardiograms (ECGs)
and radionuclide angiocardiograms(RAs) for 19 patients with
post-traumatic myocardial contusions. A “+”indicates abnormal
results and a “-”indicates normal results.
• 1.Calculate the probability of both ECG and RA is abnormal
• 2.Calculate the probability that either the ECG or the RA is
abnormal
191
193. Example
Solutions
1.P(ECG abnormal and RA abnormal) = 7/19 = 0.37
2.P(ECG abnormal or RA abnormal) = P(ECG abnormal) + P(RA
abnormal) –P(Both ECG and RA abnormal)
=17/19 + 9/19 –7/19 = 19/19 =1
• NB: We can not calculate the above probability by adding the
number of patients with abnormal ECGs to the number of
abnormal Ras, I.e. (17+9)/19 = 1.37
• The problem is that the 7 patients whose ECGs and RAs are
both abnormal are counted twice
193