SlideShare uma empresa Scribd logo
1 de 219
BIOSTATISTICS

School of pharmacy
   (COMH 607)



                     1
1.RESEARCH METHODS




                     2
1.1.Introduction to Research
What is Research?
• A scientific study to seek hidden knowledge
• A scientific study to answer a question
• A scientific study of causes and effects
• A scientific attempt towards new discoveries
• A systematic method of inquiry
• A logical attempt to find answers to problems
• A systematic approach to a (medical) problem
                                                  3
Statistical Concept of Research
• Research is a systematic collection, analysis
  and interpretation of data in order to solve a
  research question
• It is classified as:
  – Basic research: necessary to generate new
    knowledge and technologies.
  – Applied research: necessary to identify priority
    problems and to design and evaluate policies and
    programs for optimal health care and delivery.


                                                       4
1.2. Types of Epidemiological Design
A. Descriptive studies
• Mainly concerned with the distribution of diseases with
  respect to time, place and person.
• Useful for health managers to allocate resource and to
  plan effective prevention programmes.
• Useful to generate epidemiological hypothesis, an
  important first step in the search for disease
  determinant or risk factors.
• Can use information collected routinely which are
  readily available in many places. So generally descriptive
  studies are less expensive and less time-consuming than
  analytic studies.
                                                           5
• It is the most common type of
  epidemiological design strategy in medical
  literature.
• There are three main types:
   – Correlational
   – Case report or case series
   – Cross-section



                                               6
A.1. Correlational or Ecological
• Uses data from entire population to compare disease
  frequencies – between different groups during the same
  period of time, or in the same population at different
  points in time.

• Does not provide individual data, rather presents
  average exposure level in the community.

• Cause could not be ascertained.

• Correlation coefficient is the measure of association in
  correlational studies. It is important to note that
  positive association does not necessarily imply a valid
  statistical association.
                                                         7
Eg.
• Hypertension rates and average per capita salt
  consumption compared between two communities.
• Average per capita fat consumption and breast cancer
  rates compared between two communities.
• Comparing incidence of dental cares in relation to
  fluoride content of the water among towns in the rift
  valley.
• Mortality from CHD in relation to per capita cigarette
  sales among the regions of Ethiopia.


                                                       8
• Strength:       Can be done quickly and
  inexpensively, often using available data.
• Limitation:
   – Inability to link exposure with disease.
   – Lack of ability to control for effects of potential
     confounding factors. There may be other things that
     at the true cause.
   – It may mask a non-linear relationship between
     exposure and disease. For example alcohol
     consumption and mortality from CHD have a non-
     linear relationship (the curve is “J” shaped),
                                                           9
A.2. Case Report and Case Series
• Describes the experience of a single or a group of
  patients with similar diagnosis. Has limited value,
  but occasionally revolutionary.

• E.g. 5 young homosexual men with PCP seen
  between Oct. 1980 and May 1981 in Los Angeles
  arose concern among physicians. Later, with further
  follow-up and thorough investigation of the strange
  occurrence of the disease the diagnosis of AIDS
  was established for the first time.


                                                    10
• Strength:
  – very useful for hypothesis generation.
• Limitations:
  – Report is based on single or few patients, which
    could happen just by coincidence. Lack of an
    appropriate comparison group




                                                       11
A.3. Cross Sectional Studies (Survey
• Information about the status of an individual with
  respect to the presence or absence of exposure
  and disease is assessed at the same point in time.
  Easy to do-many surveys are like this.

• For factors that remain unaltered overtime, such
  as sex, race or blood group, the cross-sectional
  survey can provide evidence of a valid statistical
  association.

• Useful for raising the question of the presence of
  an association rather than for testing a hypothesis.
                                                    12
B. ANALYTIC STUDIES
• Focuses on the determinants of a disease by
  testing the hypothesis formulated from
  descriptive studies, with the ultimate goal of
  judging whether a particular exposure causes or
  prevents disease.
• Broadly classified into two
  – observational and interventional studies.
  – Both types use “controls”. The use of controls is the
    main distinguishing feature of analytic studies.


                                                        13
B.1. Observational studies
• Information are obtained by observation of events.
   No intervention is done. Cohort and case-control
   are in this category.
i. Cohort
• Subjects are selected by exposure, or determinants
   of interest, and followed to see
• If they develop the disease or outcome interest.
• E.g. Follow 100 children who received BCG
   vaccination and another 100 who didn’t get BCG
   vaccination and see how many of them get
   tuberculosis.

                                                       14
• ii. Case Control
• Subjects are selected with respect to presence or
  absence of disease, or outcome of interest, and
  then inquiries are made about past exposure to
  the factor(s) of interest.

• E.g. Take people with and without TB, ask them
  if they ever had BCG vaccination.



                                                  15
B.2. Interventional / Experimental
• The researcher does something about the disease or
  exposure and observe the changes.
• Investigator has control over who gets exposure
  and who don’t. The key is that the investigator
  assign into either group, whether it is done
  randomly or not.
• Always prospective.
• E.g. Assign children randomly to get chloroquine or
  not, and see how many develop symptomatic
  malaria.


                                                    16
Description of common terms
Statistics- It is the process of scientifically collecting,
  organizing, summarizing and interpreting of data, and the
  drawing of inferences about a body of data when only part
  of the data are observed.
Biostatistics- It is a special statistics in which the data being
  analyzed are derived from biological and medical science
Descriptive statistics: A statistical method that is concerned
  with the collection, organization, summarization, and
  analysis of data from a sample of population.
Inferential statistics: A statistical method that is concerned
  with the drawing of inferences/ conclusions about a
  particular population by selecting and measuring a random
  sample from the population.
                                                               17
Population: Is the largest collection of entities/values of
  a random variable for which we have an interest at a
  particular time. Population could be finite or infinite.
  We can take the whole number of students in a given
  class (e.g. 100 students) as a population.
   • Target population: A collection of items that have
     something in common for which we wish to draw
     conclusions at a particular time.
   • Study Population: The specific population from
     which data are collected



                                                          18
Sample: It is some part/subset of population of interest.
  In the above example, if we randomly select 25 students
  from the 100, we call the former as sample of the class.

   Hence, Generalizability is a two-stage procedure: we
    want to a generalize from the sample to the study
    population and then from the study population to
    the target population



                                                        19
Eg.: In a study of the prevalence
of HIV among orphan children in
Ethiopia, a random sample of
orphan children in LidetaKifle
Ketema were included.

Target Population: All orphan
children in Ethiopia
Study population: All orphan
children in Addis Ababa
Sample: Orphan children in
Lideta KifleKetema
                                20
Statistical inference: It is the procedure by which we reach a
   conclusion about a population on the basis of the information
   contained in a sample that has been drawn from that population.
Parameter: It is numerical expression of population measurements
   E.g. population mean (µ), population variance, population
   standard deviation, etc
 A descriptive measure computed from the data of a population.
Statistic: A descriptive measure computed from the data of a
   sample.
Statistical data: Information that is systematically collected
   tabulated and analysis for which the result is interpreted to draw
   conclusions about the result obtained.

                                                                        21
• Data: aggregate of variables as a result of measurement or
  counting.
• Variable: A characteristics that takes on different values in
  different persons, places, or things.
   – Dependent variable(response) :variable (s)we measure
     as an out come of interest
   – Independent variable(predictor) :The variable(S) that
     determines the outcome




                                                              22
Categorical variable: The notion of magnitude is
  absent or implicit.
– Nominal: have distinct levels that have no inherent
  ordering.
      – When only with two categories, are called
        binary or dichotomous.Eg. Sex; male or female
      – When more than two categories -are called
        polythumous eg color
– Ordinal: have levels that do follow a distinct
  ordering.
         Eg. severity of pain(mild, moderate severe)

                                                   23
Quantitative(numeric) variable: Variable that has magnitude
• Discrete data: when numbers represent actual measurable
  quantities rather than mere labels.
    Discrete data are restricted to taking only specified
        values often integers or counts that differ by fixed
        amounts.
        e.g. Number of new AIDS cases reported during one
             year period, Number of beds available in a particular
             hospital
• Continuous data: represent measurable quantities but are
  not restricted to taking on certain specific values i.e
  fractional values are possible. Can use interval (no true zero
  value) or ratio scale (begins at zero)
– e.g. weight, cholesterol level, time, temperature



                                                                 24
1.3.Sampling Methods

Sampling
• The process of selecting a portion of the population to represent
  the entire population.
• A main concern in sampling:
   – Ensure that the sample represents the population, and
       • The findings can be generalized.




                                                                 25
Advantages of sampling:

• Feasibility: Sampling may be the only feasible method of
  collecting information.
• Reduced cost: Sampling reduces demands on resource such
  as finance, personnel, and material.
• Greater accuracy: Sampling may lead to better accuracy of
  collecting data
• Sampling error: Precise allowance can be made for sampling
  error
• Greater speed: Data can be collected and summarized more
  quickly



                                                               26
Disadvantages of sampling:
• There is always a sampling error.
• Sampling may create a feeling of discrimination within the
  population.
• Sampling may be inadvisable where every unit in the population is
  legally      required to have a record.
Errors in sampling
1) Sampling error: Errors introduced due to selection of a sample.
   – They cannot be avoided or totally eliminated.
2) Non-sampling error:
  - Observational error
   - Respondent error
   - Lack of preciseness of definition
   - Errors in editing and tabulation of data
                                                                 27
Divisions of Sampling Methods

Two broad divisions:

A. Probability sampling methods

B. Non-probability sampling methods




                                      28
1.4.1. Probability sampling
• Involves random selection of a sample

• A sample is obtained in a way that ensures every member of the
  population to have a known, non zero probability of being
  included in the sample.

• Involves the selection of a sample from a population, based on
  chance.




                                                              29
• Probability sampling is:
   – more complex,
   – more time-consuming and
   – usually more costly than non-probability
     sampling.

• However, because study samples are randomly selected and
  their probability of inclusion can be calculated,
   – reliable estimates can be produced and
       • inferences can be made about the population.




                                                        30
• There are several different ways in which a probability sample can
  be selected.

• The method chosen depends on a number of factors, such as
   – the available sampling frame,
   – how spread out the population is,
   – how costly it is to survey members of the population




                                                                 31
Most common probability sampling methods

 1.   Simple random sampling
 2.   Systematic random sampling
 3.   Stratified random sampling
 4.   Cluster sampling
 5.   Multi-stage sampling




                                       32
1. Simple random sampling(SRS)
• Involves random selection
• Each member of a population has an equal chance of being
  included in the sample.
• To use a SRS method:
   – Make a numbered list of all the units in the population
   – Each unit should be numbered from 1 to N
          (where N is the size of the population)
   – Select the required number.




                                                               33
• The randomness of the sample is ensured by:
      • use of “lottery’ methods
      • a table of random numbers
           – Using computer programes

• Example
• Suppose your school has 500 students and you need to
  conduct a short survey on the quality of the food served in the
  cafeteria.
• You decide that a sample of 10 students should be sufficient
  for your purposes.
• In order to get your sample, you assign a number from 1 to
  500 to each student in your school.

                                                               34
• To select the sample, you use a table of randomly generated
  numbers.
• Pick a starting point in the table (a row and column number)
  and look at the random numbers that appear there. In this
  case, since the data run into three digits, the random
  numbers would need to contain three digits as well.
• Ignore all random numbers after 500 because they do not
  correspond to any of the students in the school.
• Remember that the sample is without replacement, so if a
  number recurs, skip over it and use the next random
  number.
• The first 10 different numbers between 001 and 500 make
  up your sample


                                                                 35
• SRS has certain limitations:
   – Requires a sampling frame.
   – Difficult if the reference population is dispersed.
   – Minority subgroups of interest may not be selected.




                                                           36
2. Systematic random sampling
• Sometimes called interval sampling, systematic sampling means that
  there is a gap, or interval, between each selected unit in the
  sample
• The selection is systematic rather than randomly
   – Individuals are chosen at regular interval from the sampling
     frame. Ideally we randomly select a number to tell us where
     to start selecting individuals from the list.
• Important if the reference population is arranged in some order:
   – Order of registration of patients
   – Numerical number of house numbers
   – Student’s registration books
   – Taking individuals at fixed intervals (every kth) based on the
     sampling fraction, eg. if the sample includes 20%, then every
     fifth.                                                            37
Steps in systematic random sampling
1. Number the units on your frame from 1 to N (where N is the
     total population size).
2. Determine the sampling interval (K) by dividing the number of
     units in the population by the desired sample size.




                                                               38
Steps…
.In order to find one study unit, during survey, it is important to
   figure out how many houses must be visited usually through
   doing a pilot study.
• Example: Assume you are doing a study involving children under
   5. There are 1500 households in all, and you have a required
   sample size of 100 children. From a preliminary study you have
   done, there is one child every 2.5 households. Normally, if there
   were a child in every household, you would visit 100 households.
   But because not every household includes a child, you will need
   to visit 100 x 2.5 or 250 households to find the required 100
   children.
• The sampling interval will therefore be1500/250 or every 6th
   household.


                                                                  39
3. Select a number between one and K at random. This number is
     called the random start and would be the first number included
     in your sample.

4. Select every Kth unit after that first number

      Note: Systematic sampling should not be used when a
     cyclic repetition is inherent in the sampling frame.




                                                                  40
Example

To select a sample of 100 from a population of 400, you would need
     a sampling interval of 400 ÷ 100 = 4.
Therefore, K = 4.
You will need to select one unit out of every four units to end up with
     a total of 100 units in your sample.
Select a number between 1 and 4 from a table of random numbers.
• If you choose 3, the third unit on your frame would be the first
     unit included in your sample;

•     The sample might consist of the following units to make up a
      sample of 100: 3 (the random start), 7, 11, 15, 19...395, 399 (up to
      N, which is 400 in this case).

                                                                      41
The main difference with SRS, any combination of 100 units would
  have a chance of making up the sample, while with systematic
  sampling, there are only four possible samples.




                                                                   42
Advantages .
                                 
• Systematic sampling is usually less time consuming and easier to
  perform than SRS
• It provides a good approximation to SRS (. i.e. has highest
  precision)
• Unlike SRS, systematic sampling can be conducted without a
  sampling frame. So, systematic random sampling is useful when
  preparing sampling frame is not readily available.
   – E.g. In patients attending a health center, where it is not
     possible to predict in advance who will be attending




                                                                43
Disadvantage


• If there is any sort of cyclic pattern in the ordering of the
  subjects, which coincides with the sampling interval, the sample
  will not be representative of the population.
   – May result in systematic error




                                                                44
3. Stratified random sampling

• It is done when the population is known to have heterogeneity with
  regard to some factors and those factors are used for stratification
• Using stratified sampling, the population is divided into
  homogeneous, mutually exclusive groups called strata, and
   – A population can be stratified by any variable that is available for all units prior
     to sampling (e.g., age, sex, province of residence, income, etc.).
• A separate sample is taken independently from each stratum.
• Any of the sampling methods mentioned in this section (and others
  that exist) can be used to sample within each stratum.




                                                                                  45
Why do we need to create strata?
• That it can make the sampling strategy more efficient.
• A larger sample is required to get a more accurate estimation if a
  characteristic varies greatly from one unit to the other.
• For example, if every person in a population had the same salary, then
  a sample of one individual would be enough to get a precise estimate
  of the average salary.
• This is the idea behind the efficiency gain obtained with stratification.
   – If you create strata within which units share similar characteristics
      (e.g., income) and are considerably different from units in other
      strata (e.g., occupation, type of dwelling) then you would only need
      a small sample from each stratum to get a precise estimate of total
      income for that stratum.


                                                                         46
– Then you could combine these estimates to get a precise
      estimate of total income for the whole population.
• If you use a SRS approach in the whole population without
  stratification, the sample would need to be larger than the total
  of all stratum samples to get an estimate with the same level of
  precision.




                                                                  47
• Stratified sampling ensures an adequate sample size for sub-
  groups in the population of interest.

• When a population is stratified, each stratum becomes an
  independent population and you will need to decide the sample
  size for each stratum.




                                                            48
• Equal allocation:
   – Allocate equal sample size to each stratum
• Proportionate allocation:
                       , j = 1, 2, ..., k where, k is
                                         the number of strata and
                   n
             nj =       Nj
                   N
   – nj is sample size of the jth stratum
    – Nj is population size of the jth stratum
    – n = n1 + n2 + ...+ nk is the total sample size
    – N = N1 + N2 + ...+ Nk is the total population
        size




                                                                    49
4. Cluster sampling

 • Sometimes it is too expensive to spread a sample across the
   population as a whole.
 • Travel costs can become expensive if interviewers have to
   survey people from one end of the country to the other.
 • To reduce costs, researchers may choose a cluster sampling
   technique
 • The clusters should be homogeneous, unlike stratified
   sampling where by the strata are heterogeneous




                                                                 50
Steps in cluster sampling

• Cluster sampling divides the population into groups or clusters.
• A number of clusters are selected randomly to represent the total
  population, and then all units within selected clusters are
  included in the sample.
• No units from non-selected clusters are included in the sample—
  they are represented by those from selected clusters.
• This differs from stratified sampling, where some units are
  selected from each group.




                                                                 51
Example
• In a school based study, we assume students of the same school are
  homogeneous.

• We can select randomly sections and include all students of the
  selected sections only




                                                                  52
• As mentioned, cost reduction is a reason for using cluster
  sampling.

• It creates 'pockets' of sampled units instead of spreading the
  sample over the whole territory.

• Another reason is that sometimes a list of all units in the
  population is not available, while a list of all clusters is either
  available or easy to create.




                                                                  53
• In most cases, the main drawback is a loss of efficiency when
  compared with SRS.

• It is usually better to survey a large number of small clusters instead
  of a small number of large clusters.
   – This is because neighboring units tend to be more alike, resulting
      in a sample that does not represent the whole spectrum of
      opinions or situations present in the overall population.




                                                                     54
• Another drawback to cluster sampling is that you do not have total
  control over the final sample size.
• Since not all schools have the same number of (say Grade 11)
  students and city blocks do not all have the same number of
  households, and you must interview every student or household in
  your sample, as an example, the final size may be larger or smaller
  than you expected.




                                                                 55
5. Multi-stage sampling


• Similar to the cluster sampling, except that it involves picking a
  sample from within each chosen cluster, rather than including all
  units in the cluster.
• This type of sampling requires at least two stages.




                                                                  56
• In the first stage, large groups or clusters are identified and
  selected. These clusters contain more population units than are
  needed for the final sample.

• In the second stage, population units are picked from within the
  selected clusters (using any of the possible probability sampling
  methods) for a final sample.




                                                                 57
• If more than two stages are used, the process of choosing
  population units within clusters continues until there is a final
  sample.

• With multi-stage sampling, you still have the benefit of a more
  concentrated sample for cost reduction.

• However, the sample is not as concentrated as other clusters and
  the sample size is still bigger than for a simple random sample
  size.




                                                                 58
• Also, you do not need to have a list of all of the units in the
  population. All you need is a list of clusters and list of the units in
  the selected clusters.

• Admittedly, more information is needed in this type of sample
  than what is required in cluster sampling. However, multi-stage
  sampling still saves a great amount of time and effort by not
  having to create a list of all the units in a population.




                                                                      59
1.4.2.. Non-probability sampling

• The difference between probability and non-probability
  sampling has to do with a basic assumption about the nature of
  the population under study.

• In probability sampling, every item has a known chance of being
  selected.

• In non-probability sampling, there is an assumption that there is an
  even distribution of a characteristic of interest within the
  population.




                                                                    60
• This is what makes the researcher believe that any sample would
  be representative and because of that, results will be accurate.

• For probability sampling, random is a feature of the selection
  process, rather than an assumption about the structure of the
  population.




                                                                     61
• In non-probability sampling, since elements are chosen
  arbitrarily, there is no way to estimate the probability of any one
  element being included in the sample.

• Also, no assurance is given that each item has a chance of being
  included, making it impossible either to estimate sampling
  variability or to identify possible bias




                                                                        62
• Reliability cannot be measured in non-probability sampling; the only
  way to address data quality is to compare some of the survey results
  with available information about the population.

• Still, there is no assurance that the estimates will meet an acceptable
  level of error.

• Researchers are reluctant to use these methods because there is no
  way to measure the precision of the resulting sample.




                                                                     63
• Despite these drawbacks, non-probability sampling methods can
  be useful when descriptive comments about the sample itself are
  desired.
• Secondly, they are quick, inexpensive and convenient.
• There are also other circumstances, such as researches, when it
  is unfeasible or impractical to conduct probability sampling.




                                                               64
common types of non-probability sampling

1.    Convenience or haphazard sampling
2.    Volunteer sampling
3.    Judgment sampling
4.    Quota sampling
5.    Snowball sampling technique




                                                65
1.4.Scales of measurement
• Measurement: the assignment of numbers or names or events
  according to a set of rules:
• Clearly not all measurements are the same.
• Measuring an individuals weight is qualitatively different from
  measuring their response to some treatment on a three category
  of scale, “improved”, “stable”, “not improved”.
• Measuring scales are different according to the degree of
  precision involved.
• There are four types of scales of measurement.



                                                               66
Scales…

1. Nominal scale: uses names, labels, or symbols to assign each
  measurement to one of a limited number of categories that
  cannot be ordered.
   – Examples: Blood type, sex, race, marital status

2. Ordinal scale: assigns each measurement to one of a limited
   number of categories that are ranked in terms of a graded order.
    – Examples: Patient status, Cancer stages




                                                                  67
Scales…

3. Interval scale: assigns each measurement to one of an unlimited
   number of categories that are equally spaced. It has no true zero
   point.
    – Example: Temperature measured on Celsius or Fahrenheit
4.Ratio scale: measurement begins at a true zero point and the
   scale has equal space.
    – Eg: Height, weight, blood pressure




                                                                   68
Scales…




          69
1.5.Validity and reliability
Validity and Reliability are two major
  requirements for any measurement.
  – Validity pertains to the correctness of the
    measure; a valid tool measures what it is
    supposed to measure.
  – Reliability pertains to the consistency of the tool
    across different contexts.
• Validity is often described as internal or
  external.

                                                      70
1.6.Sources and methods of data Collection and
                   it’s handling
                      Sources
                           Two major sources

Primary sources-are those data, which are collected by the
   investigator himself/herself for the purpose of a specific inquiry or
   study.
Such data are original in character and are mostly generated by surveys
   conducted by individuals or research institutions.
The first hand information obtained by the investigator is more reliable
   and accurate since the investigator can extract the correct information
   by removing doubts, if any, in the minds of the respondents regarding
   certain questions. High response rates might be obtained since the
   answers to various questions are obtained on the spot. It permits
   explanation of questions concerning difficult subject matter.
                                                                           71
Secondary data
Secondary Data: When an investigator uses data, which have
  already been collected by others, such data are called "Secondary
  Data". Such data are primary data for the agency that collected
  them, and become secondary for someone else who uses these
   data for his/her own purposes.
The secondary data can be obtained from journals, reports of
different institutions, government publications, publications of
professionals and research organizations. These data are less
   expensive and can be collected in a short time.



                                                                 72
Data collection methods
1.Observation
• is a technique that involves systematically selecting, watching and
  recoding behaviours of people or other phenomena and aspects
  of the setting in which they occur, for the purpose of getting
  specified information.
• includes all methods from simple visual observations to the use
  of high level machines and measurements, sophisticated
  equipment or facilities, such as radiographic, biochemical, X-ray
  machines, microscope, clinical examinations, and microbiological
  examinations.

                                                                   73
Observation…
• Advantages: Gives relatively more accurate data on behaviour
  and activities
• Disadvantages: Investigators or observer’s own biases, prejudice,
  desires, and etc. .
• needs more resources and skilled human power during the use of
  high level machines.




                                                                  74
2. The Documentary sources
• Include clinical records and other personal records, published
   mortality statistics, census publications, etc.
• Advantages:
a) Documents can provide ready-made information relatively easily
b) The best means of studying past events
• Disadvantages:
a) Problems of reliability and validity (because the information is
   collected by a number of different persons who may have used
   different definitions or methods of obtaining data).
b) There is a possibility that errors may occur when the information
   is extracted from the records .

                                                                  75
3. Interviews and self-administered questionnaire

a) Interviews: may be less or more structured.
A public health worker conducting interviews may be armed with a
  checklist of topics, but may not decide in advance precisely what
  questions he/she will ask.


• This approach is flexible; the content, wording and order of the
  questions are relatively unstructured.
   – the content, wording and order of the questions vary from interview to
      interview.

                                                                          76
Interviews…

On the other hand, in other situations a more standardized technique may
   be used, the wording and order of the questions being decided in

advance.

This may take the form of a highly structured interview(interviewing using
   questionnaire),
• the investigator appoints persons/enumerators, who go to the
   respondents personally with the questionnaire, ask them questions and
   record their replies.
    – This can be done using telephone or face-to-face interviews.

                                                                        77
Interviews…
• Questions may take two general forms: they may be “open
  ended” questions, which the subject answers in his/her own
  words,
• or “closed” questions, which are answered by choosing from a
  number of fixed alternative responses.




                                                            78
Advantage of interview

• A good interviewer can stimulate and maintain the respondent’s
  interest. This leads to the frank answering of questions.
• If anxiety is aroused (e.g., why am I being asked these
  questions?) , the interviewer can allay it.
An interviewer:
• can repeat questions which are not understood, and give
  standardized explanations where necessary.
• can ask “follow-up” or “probing” questions to clarify a
  response.
• can make observations during the interview;
• i.e., note is taken not only of what the subject says but also
  how he/she says it.

                                                              79
b. self-administered questionnaire


• The respondent reads the questions and fills in the answers by
  himself/herself (sometimes in the presence of an interviewer
  who “stands by” to give assistance if necessary).
• The use of self-administered questionnaires is simpler and
  cheaper;
• can be administered to many persons simultaneously (e.g. to
  a class of school children).
• They can be sent by post. However, they demand a certain
  level of education on the part of the respondent.

                                                              80
.
• Quantitative data are commonly collected using structured
  interviews (where standard questionnaires are common and the
  collected data can relatively be processed easily) where as,
• qualitative data are usually collected using unstructured
  interviews.
• The unstructured interviews are undertaken by the help of check
  lists, key informant interviews, focus group discussions, etc.


                                                                   81
Qualitative…
Checklist - is a list of questions prepared ahead of time to facilitate
  the interviews or discussions. It is not an exhaustive one. It helps
  the facilitator not to miss any of the important topics under
  consideration.
Key informant interviews – interviews done with influential
  individuals (such as community elders, priests, etc.).
Focus group discussions – discussions made with a group of
  respondents.
• The group contains 6 to 12 people who are more or less similar
  with respect to level of education, marital status, age, sex, etc.
  (this composition helps each respondent to talk freely without
  being dominated by the other).

                                                                     82
Steps in Questionnaire Design

1. Before beginning to construct, make sure that the questionnaire
  is the best method of collecting data for your objectives
   – To know before hand what information is needed and what is
      going to be done with this information

2. While drafting the questions one has to know: Why question is
  asked and what will be done with information (to prevent
  wastage of extra resources)




                                                                83
Steps in…
3. To get valid and reliable information:
• the wording and sequence of question should be able to
  facilitate their recall or remember
• prevent forgetfulness of the respondents
• avoid difficult/ time consuming or embarrassing or too
  personal question
• the flow of questions should be from simple to complex
  and from general to specific, from impersonal to personal
• confidentiality care should be taken for the respondent
• Cover letter( if by mail)
• Identify by ID(rather than name)
                                                         84
Data Collection and handling
          Process




                               85
Data collection
A plan for data collection can be made in two steps:
1. Listing the tasks that have to be carried out and who
   should be involved, making a rough estimate of the time needed
   for the different parts of the study, and identifying the most
   appropriate period in which to carry out the research
2. Actually scheduling the different activities that have to
   be carried out each week in a work plan




                                                                86
Why should you develop a plan for data
                collection?
A plan for data collection should be developed so that:
   – you will have a clear overview of what tasks have to be carried out,
     who should perform them, and the duration of these tasks;
   – you can organize both human and material resources for data
     collection in the most efficient way; and
   – you can minimize errors and delays which may result from lack of
     planning (for example, the population not being available or
     data forms being misplaced).




                                                                       87
Data collection process

Stages

• Stage 1: Permission to proceed
   – Obtaining consent from the relevant authorities,
     individuals and the community in which the project
     is to be carried out




                                                          88
Data collection process
Stage 2: Data collection
• Logistics
   – who will collect what,
   – when and
   – with what resources


• Quality control
   –   Prepare a field work manual
   –   Select your research assistants
   –   Train research assistants
   –   Supervision
   –   Checked for completeness and accuracy   89
Data collection process
• How long will it take to collect the data for each
  component of the study?
   – Step 1: Consider the time required to reach the study
     area; to locate the study units; the number of visits
     required per study unit and for follow-up of non-
     respondents
   – Step 2: Calculate the number of interviews that can
     be carried out per person per day
   – Step 3: Calculate the number of days needed to carry
     out the interviews.

                                                         90
Ensuring data quality
Measures to help ensure good quality of data:
 Prepare a field work manual for the research team
  as a whole
 Select your research assistants, if required, with
  care
 Train research assistants carefully in all topics
  covered in the field work manual as well as in
  interview techniques
 Pre-test research instruments and research
  procedures with the whole research team,
  including research assistants.
                                                  91
Ensuring data quality
 Take care that research assistants are not placed
  under too much stress
 Arrange for on-going supervision of research
  assistants and guidelines should be developed
  for supervisory tasks.
 Devise methods to assure the quality of data
  collected by all members of the research team.



                                                      92
Data Collection Process
Stage 3: Data handling
• Once the data have been collected and checked for
  completeness and accuracy, a clear procedure should be
  developed for handling and storing them
• Numbering of all questionnaires
• Identify the person responsible for storing data and the
  place where it will be stored
• Decide how data should be stored. Record forms
  should be kept in the sequence in which they have been
  numbered.

                                                         93
Research Assistants
• This includes – data collectors, supervisors and
  may be local guides
• Selection – during selection one should consider
  similarities in educational level and may be sex
  composition
• Training – all research assistants and team
  members should be trained together



                                                 94
Pre-test and pilot study
A pre-test usually refers to a small-scale trial of particular
  research components.

A pilot study is the process of carrying out a preliminary
  study, going through the entire research procedure with a small
  sample.

Why do we carry out a pre-test or pilot study?

A pre-test or pilot study serves as a trial run that allows us
  to identify potential problems in the proposed study.


                                                                    95
Pre-test and pilot study
What aspects of your research methodology can be
   evaluated during pre-testing?
1. Reactions of the respondents to the research
   procedures can be observed in the pre-test – availability
   and willingness
2. The data-collection tools can be pre-tested
3. Sampling procedures can be checked
4. Staffing and activities of the research team can be
   checked, while all are involved in the pre-test
5. Procedures for data processing and analysis can be
   evaluated during the pre-test
6. The proposed work plan and budget for research
   activities can be assessed during the pre-test.
                                                         96
Plan for data processing & analysis
• Data processing and analysis should start in the
  field, with checking for completeness of the data
  and
• Performing quality control checks, while sorting
  the data by instrument used and by group of
  informants
• Data of small samples may even be processed and
  analyzed as soon as it is collected.


                                                 97
Plan for data processing & analysis
• The plan for data processing and analysis must be made
  after careful consideration of the objectives of the study as well as
  of the tools developed to meet the objectives.

• The procedures for the analysis of data collected
  through qualitative and quantitative techniques are quite
  different.
    – For quantitative data the starting point in analysis is usually a
      description of the data for each variable
    – For qualitative data it is more a matter of describing,
      summarizing and interpreting the data obtained for each
      study unit

                                                                      98
Plan for data processing & analysis
• When making a plan for data processing and
  analysis the following issues should be
  considered:
  – Sorting data,
  –  Performing quality-control checks,
  –  Data processing, and
  –  Data analysis.




                                               99
Data processing and analysis
• Sorting data
  – Into groups of different study populations or
    comparison groups
• Quality control checks
  – Check again for completeness and internal
    consistency
  – Missing data - if many exclude the questionnaire
  – Inconsistency - correct, return or exclude



                                                       100
Data processing
• Decide whether to process and analyse the data from
  questionnaires:
   – manually, using data master sheets or manual compilation of
     the questionnaires, or
   – by computer, for example, using a micro-computer and
     existing software or self-written programmes for data
     analysis.
• Data processing in both cases involves:
      • categorising the data,
      • coding, and
      • summarising the data in data master sheets, manual compilation
        without master sheets, or
      • data entry and verification by computer.

                                                                         101
2.Descriptive statistics

  (Data summarization)




                           102
2.Data summarization(Descriptive statistics)
2.1.Describing variables
The methods of describing variables differ depending on the
  type of data
 Categorical or Numerical
Some times we transform numeric data into categorical.eg
  age.
   – when lesser degree detail is required
• This is achieved by dividing the range of values, which the
 numeric variable takes into intervals.



                                                                103
Describing…

Categorical variables
• Table of frequency distributions
   – Frequency
   – Relative frequency
   – Cumulative frequencies
• Charts
   – Bar charts
   – Pie charts




                                     104
Describing …




               105
In summary,
• There are three ways we can summarize and present data:
• Tabular representation - summarizing data by making a table of
  the data called frequency distributions.
• Graphical representation of data - we can make a graph of the
  data.
• Numerical representation of data - we can use a single number to
  represent many numbers.
   – Measures of central tendency.
   – Measures of variability.




                                                                106
2.2. Frequency Distribution
• A frequency distribution shows the number of observations falling
  into each of several ranges of values.
• Four different types of frequency distributions.
   – Simple frequency distribution (or it can be just called a
      frequency distribution).
   – Cummulative frequency distribution.
   – Grouped frequency distribution.
   – Cummulative grouped frequency distribution.
• Are portrayed as Frequency tables, histograms, or polygons
• Can show either the actual number of observations falling in each
  range or the percentage of observations. In the latter instance, the
  distribution is called a relative frequency distribution
                                                                  107
Simple frequency distribution
Consider the following set of data which are the high
temperatures recorded for 30 consecutive days. We wish
to summarize this data by creating a frequency
distribution of the temperatures.

   Data Set - High Temperatures for 30 Days
   50      45       49     50       43
   49        50       49        45    49
   47        47       44        51    51
   44        47       46        50    44
   51        49       43        43    49
   45        46       45        51    46



                                                         108
Simple frequency distribution…
To create a frequency distribution from this
data proceed as follows:
.
1. Identify the highest and lowest values in the
   data set. For our temperatures the highest
   temperature is 51 and the lowest temperature is
   43.
2. Create a column with the title of the variable we
   are using, in this case temperature. Enter the
   highest score at the top, and include all values
   within the range from the highest score to the
   lowest score.


                                                       109
Simple frequency…

3. Create a tally column to keep track of the scores as you
   enter them into the frequency distribution. Once the
   frequency distribution is completed you can omit this
   column
4. Create a frequency column, with the frequency of each
   value, as show in the tally column, recorded.
5. At the bottom of the frequency column record the total
   frequency for the distribution proceeded by N =
6. Enter the name of the frequency distribution at the top
   of the table.
                                                         110
Simple frequency…
If we applied these steps to the temperature data above
we would have the following frequency distribution
  Frequency Distribution for High Temperatures
     Temperature        Tally     Frequency
  51                  ////     4
  50                     ////      4
  49                     //// /    6
  48                               0
  47                     ///       3
  46                     ///       3
  45                     ////      4
  44                     ///       3
  43                     ///       3
                         N =       30

                                                          111
Cumulative frequency distribution
To create a cummulative frequency distribution:
• Create a frequency distribution
• Add a column entitled cummulative frequency
• The cummulative frequency for each score is the
  frequency up to and including the frequency for that
  score
• The highest cummulative frequency should equal N
  (the total of the frequency column)



                                                         112
Cumulative frequency…
Cummulative Frequency Distribution for High Temperatures
Temperature Tally Frequency Cummulative Frequency
51            ////   4         30
50            ////   4         26
49            ////// 6         22
48                   0         16
47            ///    3         16
46            ///    3         13
45            ////   4         10
44            ///    3         6
43            ///    3         3
              N=     30


                                                           113
Grouped frequency distribution
To create a grouped frequency distribution:
• select an interval size so that you have 7-20 class intervals
    Al so By using surges’ rule
• create a class interval column and list each of the class
  intervals
• each interval must be the same size, they must not overlap,
  there may be no gaps within the range of class intervals
• create a tally column (optional)
• create a midpoint column for interval midpoints
• create a frequency column
• enter N = some value at the bottom of the frequency
  column

                                                             114
Grouped frequency for the temperature data

Grouped Frequency Distribution for High Temperatures
Class Interval Tally Interval Midpoint Frequency
57-59         //////  58                  6
54-56         ///////   55               7
51-53         /////////// 52             11
48-50         ///////// 49               9
45-47         ///////   46               7
42-44         //////    43               6
39-41         ////      40               4
                        N=               50



                                                       115
Cumulative grouped frequency distribution
        We just add a cumulative frequency column to the grouped
        frequency distribution and we have a cumulative grouped
        frequency distribution as shown below.

      Cumulative Grouped Frequency Distribution for High Temperatures
Class Interval Tally Interval Midpoint Frequency Cumulative Frequency
57-59         ////// 58                  6           50
54-56        ///////   55            7          44
51-53        /////////// 52          11         37
48-50        ///////// 49            9          26
45-47        ///////   46            7          17
42-44        //////    43            6          10
39-41        ////      40            4          4
                       N=            50




                                                                        116
Relative Frequency
• Sometimes it is useful to compute the proportion, or percentages of
   observations in each category.
• Relative frequency of a particular category is the proportion(fracttion)
   of observations that fall into the particular category.
• The cumulative frequency (or proportions) is addition of the
   frequencies in each category from zero to a particular category.
    – Is the relative frequency of items less than or equal to the upper class
       limit of each class.
• For quantitative data and for categorical (qualitative) data (but only if the
   latter are ordinal )

                                                                           117
Characteristics and guidelines of table
                   construction

Characteristics
• Table must be explanatory

• Title should describe the content of the table and should answer
  the question what? Where? And when? It was collected
• Percentages in each category should add up to 100

• Foot notes should be placed at the bottom of the table



                                                                118
Guidelines

• The shape and size of the table should contain the required
  number of raw and Columns to accommodate the whole data
• If a quantity is zero, it should be entered as zero, and leaving
  blank space or putting dash in place of zero is confusing and
  undesirable
• In case two or more figures are the same, ditto marks should not
  be used in a table in the place of the original numerals
• If any figures in a table has to be specified for a particular
  purpose, it should be marked with asterisk

                                                                119
2.3. Diagrammatic Representation
2.3.1. Importance of diagrammatic representation:

1.Diagrams have greater attraction than mere figures. They give
   delight to the eye, add a spark of interest and as such catch the
   attention as much as the figures dispel it.

2.They help in deriving the required information in less time and
   without any mental strain.

3.They have great memorizing value than mere figures. This is so
   because the impression left by the diagram is of a lasting nature.

4.They facilitate comparison

                                                                   120
Importance….

Well designed graphs can be an incredibly powerful means of
  communicating a great deal of information using visual
  techniques

When graphs are poorly designed, they not only do not effectively
 convey your message, they often mislead and confuse.




                                                                    121
2.3.2.Types
                       1. Bar graph
•Bar diagram is the easiest and most adaptable general
  purpose chart.
•Though this type of chart can be used for any type of series,
  it is especially satisfactory for nominal and ordinal data.
•The categories are represented on the base line (X-axis) at
  regular interval and the corresponding values of frequencies
  or relative frequencies represented on the Y-axis (ordinate)
  in the case of vertical bar diagram and vis-versa in the case
  of horizontal bar diagram.


                                                             122
Method of constructing bar graph
•All bars drawn in any single study should be of the same width
•The different bars should be separated by equal distances
•All the bars should rest on the same line called the base
•It is better to construct a diagram on a graph paper

Types of bar graph
• 1.Simple bar graph: It is one-dimensional diagram in which the
  bar represents the whole of the magnitude. The height/length of
  each bar indicates the frequency of the figure represented.
Example: Construct a bar graph for the following data



                                                                  123
Table__, Distribution of pediatric patients in X hospital ward by type of
                     admitting diagnosis Jan, 2000



   Diagnosis                Number of patients        Relative freq (%)
   Pneumonia                487                       48.7
   Malaria                  200                       20
   Cardiac problems         168                       16.8
   Malnutrition             80                        8.0
   Others                   65                        6.5
   Total                    1000                      100




                                                                            124
1. Simple bar graph…



.




                       125
2.Sub-divided (component) bar graph
 
• It is also called segmented bar graph. If a given magnitude can be
  split up into subdivisions, or if there are different quantities
  forming the subdivisions of the totals, simple bars may be
  subdivided in the ratio of the various subdivisions to exhibit the
  relationship of the parts to the whole.

• The order in which the components are shown in a "bar" is
  followed in all bars used in the diagram.




                                                                  126
2.Sub-divided…




                 127
3. Multiple bar graph

Multiple Bar diagrams can be used to represent the
 relationships among more than two variables.
The following figure shows the relationship
 between children’s reports of breathlessness and
 cigarette smoking by themselves and their
 parents.



                                                128
3. Multiple bar graph…




                         129
3. Multiple bar graph…

• We can see from the graph quickly that the prevalence of the
  system increases both with the child's smoking and with that of
  their parents.




                                                                    130
2. Pie chart
Pie chart shows the relative frequency for each category by dividing
   a circle into sectors, the angles of which are proportional to the
   relative frequency.

Steps to construct a pie-chart
 Construct a frequency table
 Change the frequency into percentage (P)
 Change the percentages into degrees, where: degree =
   Percentage X 360o
 Draw a circle and divide it accordingly


                                                                   131
2. Pie chart…

  Example: Distribution of death for females, in England and Wales, 1989.

           Cause of death                Number (%)of deaths
   Circulatory system (C)                100,000
• Neoplasm (N)
  --                                     70,000
   Respiratory system(R)                 30,000
   Injury & poisoning (I)                6,000
   Digestive system (D)                  10,000
   Others (O)                            20,000
   Total                                 236,000




                                                                            132
2. Pie chart…




                133
3.Histogram
Histograms are frequency distributions with continuous class
  interval that have been turned into graphs.
To construct a histogram, we draw the interval boundaries on a
  horizontal line and the frequencies on a vertical line.
Non-overlapping intervals that cover all of the data values must be
  used.
Bars are then drawn over the intervals in such a way that the areas
  of the bars are all proportional in the same way to their interval
  frequencies.



                                                                  134
Example: Distribution of the RBC cholinesterase values
(µmol/min/ml) obtained from 35 workers Exposed to Pesticides
eg.   RBC cholinesterase (µmol/min/ml) Frequency, n (%) Cumulative frequency (%)
      5.95-7.95                                                     1(2.9)       2.9
      7.95-9.95                                                     8(22.9)      25.8
      9.95-11.95                                                    14(40)       65.8
      11.95-13.95                                                   9(25.7)      91.5
      13.95-15.95                                                   2(5.7)       97.2
      15.95-17.95                                                   1(2.9)       100
      Total                                                         35(100)

      Source: Knapp RG, Miller MC III: Clinical Epidemiology and biostatistics




                                                                                        135
3.Histogram…

                                                 Histogram of the RBC cholinesterase values of 35
• .   Number of pesticide exposed workers
                                                 pesticide exposed workers
                                            16

                                            14

                                            12

                                            10

                                             8

                                             6

                                             4

                                             2

                                             0
                                                    6.95      8.95     10.95      12.95     14.95   16.95


                                                              RBC choilinesterase(umol/min/ml)

                                                                                                            136
4.Frequency polygon
A frequency distribution can be portrayed graphically in yet another way
    by means of a frequency polygon.
•To draw a frequency polygon we connect the mid-point of the tops of
    the cells of the histogram by a straight line.
•It can be also drawn without erecting rectangles as follows:

The scale should be marked in the numerical values of the mid-points
   of intervals.
Erect ordinates on the mid-point of the interval-the length or altitude of
   an ordinate representing the frequency of the class on whose mid-point
   it is erected.
Join the tops of the ordinates and extend the connecting line to the scale
   of sizes.

                                                                       137
4.Frequency polygon…




                       138
5.Cumulative frequency polygon (ogive curve)
Some times it may become necessary to know the number of items
  whose values are more or less than a certain amount.
•We may, for example, be interested in knowing the number of
  patients whose weight is less than 50 Kg or more than say 60 Kg.
•To get this information it is necessary to change the form of the
  frequency distribution from a ‘simple’ to ‘cumulative'
  distribution.
•Ogive curve turns a cumulative frequency distribution in to
  graphs.



                                                                139
5.Cumulative frequency polygon (ogive curve)…
Example: Heart rate of patients admitted to Hospital B, 2000
  Heart rate            No. of patients   Cumulative freq., less   Cumulative     freq.,
          (Beat/min)                      than method              greater than method
  54.95-59.5                   1                 1                        54
  59.5-64.5                    5                 6                        53
  64.5-69.5                    3                 9                        48
  69.5-74.5                    5                 14                       45
  74.5-79.5                    11                25                       40
  79.5-84.5                    16                41                       29
  84.5-89.5                    5                 46                       13
  89.5-94.5                    5                 51                       8
  94.5-99.5                    2                 53                       3
  99.5-104.5                   1                 54                       1
  Total                        54


                                                                                           140
5.Cumulative frequency polygon (ogive curve)
                    …




                                           141
6.Box-and-whisker plot
It is another way to display information when the objective is to
   illustrate certain location in the distribution.
A box is drawn with the top of the box at the third quartile and the
   bottom at the first quartile.
The location of the midpoint of the distribution is indicated with a
   horizontal line in the box.
Finally, straight lines or whiskers are drawn from the center of the
   top of the box to the largest observation and from the center of
   the bottom of the box to the smallest observation.
Useful When one of the characteristics is qualitative and the other is
   quantitative



                                                                   142
Eg: percentage super saturation of bile by sex of patients
                        Men                              Women
         Subject          Age   %Super        Subject        Age   %Super
                                saturation                         saturation
                   1      23            40              1    40    65
                   2      31            86              2    33    86
                   3      58            11              3    49    76
.                  4
                   5
                          25
                          63
                                        86
                                        106
                                                        4
                                                        5
                                                             44
                                                             63
                                                                   89
                                                                   142
                   6      43            66              6    27    58
                   7      67            123             7    23    98
                   8      48            90              8    56    146
                   9      29            112             9    41    80
                   10     26            52              10   30    66
                   11     64            88              11   38    52
                   12     55            137             12   23    35
                   13     31            88              13   35    55
                   14     20            80              14   50    127
                   15     23            65              15   47    77
                   16     43            79              16   36    91
                   17     27            87              17   74    128
                   18     63            56              18   53    75
                   19     59            110             19   41    82
                   20     53            106             20   25    89
                   21     66            110             21   57    84
                   22     48            78              22   42    116
                   23     27            80              23   49    73
                   24     32            47              24   60    87
                   25     62            74              25   23    76
                   26     36            58              26   48    107
                   27     29            88              27   44    84
                   28     27            73              28   37    120
                   29     65            118             29   57    123
                   30     42            67
                   31     60            57
                                                                                143
Box-and-whisker plot…




                        144
Box-and-whisker plot
• The graphs indicate the similarity of the distribution
  between the percentage saturation of bile in men and
  women.

•Again, we see that percentage saturation of bile is a bit more
  spread out among women with range 35 to 146 but we see
  also that the mid-points of the distributions are almost the
  same and that most of the spread in values in women
  occurs in the upper half of the distribution.




                                                            145
7.Scatter plot
Most studies in medicine involve measuring more than one
  characteristic, and graphs displaying the relationship between
  two characteristics are common in the literature.

• To illustrate the relationship between two characteristics when
   both are quantitative variables we use bivariate plots (also called
   scatter plots or scatter diagrams).
A scatter diagram is constructed by drawing X-and Y-axes.
 •Each observation is represented by a point or dot(•).
•In the same study on percentage saturation of bile, information
   was collected on the age of each patient to see whether a
   relationship existed between the two measures, the following
   plot was displayed.
                                                                    146
7.Scatter plot…




The graph suggests the possibility of a positive relationship
between age and percentage saturation of bile in women.

                                                                147
8.Line graph
In this type of graph, we have two variables under consideration
   like that of scatter diagram.
•A variable is taken along X-axis and the other along Y-axis.
•The points are plotted and joined by line segments in order.
•These graphs depict the trend or variability occurring in the data.
•Sometimes two or more graphs are drawn on the same graph
   paper taking the same scale so that the plotted graphs are
   comparable.
Example:
The following graph shows level of zidovudine(AZT) in the blood
   of AIDS patients at several times after administration of the
   drug, with normal fat absorption and with fat mal absorption.


                                                                  148
Response to administration of zidovudine in two
  groups of AIDS patients in hospital X, 1999.




                                                  149
Data Summarization (Numeric
        Summery)




                              150
Measures of central tendency

On the scale of values of a variable there is a certain stage at which
  the largest number of items tend to cluster.

Since this stage is usually in the centre of distribution, the tendency
   of the statistical data to get concentrated at certain values is
   called “central tendency”

The various methods of determining the actual value at which the
  data tends to concentrate are called measures of central tendency.



                                                                     151
Measures of central tendency…
The most important objective of calculating measure of central
  tendency is to determine a single figure which may be used
  to represent a whole series involving magnitude of the same
  variable.

In that sense it is an even more compact description of the
   statistical data than the frequency distribution.
•Since a measure of central tendency represents the entire data,
   it facilitates comparison with in one group or between
   groups of data.



                                                              152
Measures of central tendency…
Characteristics of a good measure of central tendency
A measure of central tendency is good or satisfactory if it
   possesses the following characteristics.
1.It should be based on all the observations
2.It should not be affected by the extreme values
3.It should be as close to the maximum number of values as
   possible
4.It should have a definite value
5.It should not be subjected to complicated and tedious
   calculations
6.It should be capable of further algebraic treatment
7.It should be stable with regard to sampling

                                                         153
Arithmetic mean (x)
The most familiar MCT is the AM. It is also popularly known
     as average.
a) Ungrouped data
If x1.,x2., ..., xn are n observed values,
Then:




                                                          154
Arithmetic mean…
b) Grouped data .In calculating the mean from grouped data, we
   assume that all values falling into a particular class interval are
   located at the mid-point of the interval. It is calculated as follow:




       where, k = the number of class intervals
       mi = the mid-point of the ith class interval
       fi = the frequency of the ith class interval

                                                                     155
Arithmetic mean…
           Example.




Mean = 2630/100 = 26.3



                         156
Arithmetic mean…
• The arithmetic mean possesses the following properties.
• Uniqueness: For given set of data there is one and only one
  arithmetic mean.
• Simplicity: The arithmetic mean is easily understood and
  easy to compute.
• Center of gravity: Algebraic sum of the deviations of the
  given values from their arithmetic mean is always zero.
• Sensitivity: The arithmetic mean possesses all the
  characteristics of a central value, except No.2, (is greatly
  affected by the extreme values).
• In case of grouped data if any class interval is open,
  arithmetic mean can not be calculated


                                                            157
The Median(X)
• a) Ungrouped data
•The median of a finite set of values is that value which divides the
   set of values in to two equal parts such that the number of values
   greater than the median is equal to the number of values less
   than the median.
•If the number of values is odd, the median will be the middle value
   when all values have been arranged in order of magnitude.
•When the number of observations is even, there is no single
   middle observation but two middle observations.
     •In this case the median taken to be the mean of these two
       middle observations, when all observations have been
       arranged in the order their magnitude


                                                                   158
The Median…

b) Grouped data
• In calculating the median from grouped data, we assume that the
  values within a class-interval are evenly distributed through the
  interval.
• The first step is to locate the class interval in which it is located.
  We use the following procedure.
• Find n/2 and see a class interval with a minimum cumulative
  frequency which contains n/2.
• To find a unique median value, use the following interpolation
  formal.



                                                                     159
Median…




Where,Lm= lower true class boundary of the interval containing
the median
Fc= cumulative frequency of the interval just above the median
class interval
fm= frequency of the interval containing the median
W= class interval width
n = total number of observations




                                                                 160
Median…..
                     Example




n/2 = 75/2 = 37.5
Median class interval = 35-44
Lm=34.5 ,Fc= 35, W = 10, n = 75,fm=22
•Median = 34.5 + (37.5-35)/22 x 10 = 35.64


                                             161
Properties of the median
• There is only one median for a given set of data
• The median is easy to calculate
• Median is a positional average and hence it is not drastically
  affected by extreme values
• Median can be calculated even in the case of open end
  intervals
• It is not a good representative of data if the number of
  items is small




                                                              162
Mode (x)
a) Ungrouped data
•It is a value which occurs most frequently in a set of values.
•If all the values are different there is no mode, on the other hand, a
   set of values may have more than one mode.

b) Grouped data
• In designating the mode of grouped data, we usually refer to the
   modal class, where the modal class is the class interval with the
   highest frequency.
• If a single value for the mode of grouped data must be specified,
   it is taken as the mid point of the modal class interval.



                                                                      163
Properties of mode

•   It is not affected by extreme values
•   It can be calculated for distributions with open end classes
•   Often its value is not unique
•   The main drawback of mode is that often it does not exist




                                                                   164
MEASURES OF POSITIONS
                       Quartiles
• Divide the distribution into four equal parts. The 25th percentile
   demarcates the first quartile (Q1),
• the median or 50th percentile demarcates the second quartile (Q2),
• the 75th percentile demarcates the third quartile (Q3),
• and the 100th percentile demarcates the fourth quartile (Q4), which is
   the maximum observation.
 
Q1 is the ¼ (n+1)th measurement, i.e, 25% of all the ranked observations
   are less than Q1.
Q2 is 2/4 (n+1)th = (n+1 /2)th measurement. I.e. 50% of all ranked
   observations are less than Q2. Q2=2 Q1
Q3 is the ¾ (n+1)th observation. Q3= 3 Q1. It indicates that 75% of all the
   ranked observations are less than Q3.
 
                                                                        165
Percentile
• Is Simply dividing the data into 100 pieces.
• value in a set of data that has 100% of the observations at or
  below it. When we consider it in this way, we call it the 100th
  percentile.
• From this same perspective, the median, which has 50% of the
  observations at or below it, is the 50th percentile.
• The pth percentile of a distribution is the value such that p percent
  of the observations are less than or equal to it.
The pth percentile value depends on whether np/100 is an integer or
  not:
The (k+1) Th largest sample point if np/100 is not an integer where k
  is the largest integer less than np/100.
The average of the (np/100) th and (np/100+1) th largest observation
  when np/100 is an integer

                                                                    166
Percentiles…
Example: The following data is the sample of birth weights (grams) of live births
   at a hospital during a week period.
  3265, 3248, 2838, 3323, 3245, 3101, 2581, 3200, 4146, 2759, 3609, 2069, 3260,

         3314, 3541, 3649, 3484, 2834, 2841, 3031.
Calculate the 10th and 90th percentiles
Solution: n=20; p=0.1 & 0.9 First put the data in ascending order
 
2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245, 3248, 3260, 3265,
    3314,3323,3484,3541,3609,3649,4146.
 
10th percentile = np/100= 20x0.1=2 which is an integer. So, the 10 th percentile
    will be the average of the 2nd and the 3rd ordered observation which is 2581+
    2759 divided by two which is equal to 2670 grams.

The 90th percentile=np/100= 20x0.9=18 which is an integer. So, the 90 th
   percentile will be the average of the 18 th and the 19th ordered observation
   which is 3609+ 3649 divided by two which is equal to 3629 grams.

                                                                              167
Percentiles…

• Therefore, we would say that 80 percent of the birth weights
  would fall between 2607 g and 3629 g, which give us an overall
  feel for the spread of the distribution.
 
• The most commonly used percentiles other than the median (50th
  percentile) are the 25th percentile and the 75th percentile.




                                                              168
Measures of variability

• The measure of central tendency alone is not enough to have a
  clear idea about the distribution of the data.
• Moreover, two or more sets may have the same mean and/or
  median but they may be quite different.
• Thus to have a clear picture of data, one needs to have a measure
  of dispersion or variability (scatterdness) amongst observations
  in the set.




                                                                 169
Range (R)


R = XL-XS,
where
• XLis the largest value and XSis the smallest value.
• Properties
• It is the simplest measure and can be easily understood
• It takes into account only two values which causes it to be a poor
  measure of dispersion




                                                                  170
Interquartilerange (IQR)

IQR = Q3-Q1,
Where,
Q3is the third quartile and Q1is the first quartile.
Example: Suppose the first and third quartile for weights of girls 12
    months of age are 8.8 Kg and 10.2 Kg respectively. The
    interrquartile range is therefore,
IQR = 10.2 Kg –8.8 Kg,
i.e.,50% of infant girls at 12 months weigh between 8.8 and 10.2
    Kg.



                                                                   171
Interquartilerange …




                       172
Interquartile…
• Generally, we use interquartile range to describe variability when
  we use the median as the measure of central location. We use the
  standard deviation, which is described in the next section, when
  we use the mean.
Properties
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on two specific values
• It is important in selecting cut-off points in the formulation of
  clinical standards
• Since it excludes the lowest and highest 25% values, it is not
  affected by extreme values
• It is not capable of further algebraic treatment
                                                                   173
Quartile deviation (QD)



Coefficient of quartile deviation (CQD)




 CQD is an absolute quantity (unit less) and is useful to
 compare the variability among the middle 50%
 observations.

                                                            174
Mean deviation (MD)
•Mean deviation is the average of the absolute deviations taken
  from a central value, generally the mean or median.
•Consider a set of n observations x1, x2, ..., xn.
Then,




    Where, A is a central value (arithmetic mean or median).




                                                               175
Mean deviation …
Properties
• MD removes one main objection of the earlier measures, that it
  involves each value
• It is not affected much by extreme values
• Its main drawback is that algebraic negative signs of the
  deviations are ignored which is mathematically unsound
• MD is minimum when the deviations are taken from median.




                                                              176
The Variance (σ2, S2)

• The main objection of mean deviation, that the negative signs are
  ignored, is removed by taking the square of the deviations from
  the mean.

• The variance is the average of the squares of the deviations taken
  from the mean.




                                                                  177
Variance…
a)Ungrouped data
Let X1, X2, ..., XN be the measurement on N population units, then;




                                                                      178
Variance…

The sample variance of the set x1, x2, ..., xn of n observations is:




                                                                       179
Variance…
b)Grouped data




                             180
Variance…

Properties
• The main demerit of variance is, that its unit is the square of the
  unit of measurement of variate values
• The variance gives more weightage to the extreme values as
  compared to those which are near to mean value, because the
  difference is squared in variance.
• The drawbacks of variance are overcome by the standard
  deviation.




                                                                   181
Standard deviation (σ, S)
It is the positive square root of the variance.




 Properties

 •Standard   deviation is considered to be the best measure of
 dispersion and is used widely because of the properties of the
 theoretical normal curve.
 •There is however one difficulty with it. If the units of
 measurements of variables of two series is not the same, then
 there variability can not be compared by comparing the values of
 standard deviation.
 Formula sheet for variance and standard deviation.doc
 Example to calculate variance.doc
                                                                182
Coefficient of variation
• When we desire to compare the variability in two sets of
  data, the standard deviation which calculates the absolute
  variation may lead to false results.

• The coefficient of variation gives relative variation & is the
  best measure used to compare the variability in two sets of
  data. Never use SD to compare variability between groups.

• CV = standard deviation
              Mean




                                                                   183
4.Basic Probability and probability
                    distributions
• Probability is a mathematical technique for predicting
  outcomes. It predicts how likely it is that specific events will
  occur.
• An understanding of probability is fundamental for quantifying
   the uncertainty that is inherent in the decision-making process
• Probability theory also allows us to draw conclusions about a
   population of patients based on known information about a
   sample of patients drawn from that population.


                                                                     184
Basic Probability…
• Mutually exclusive events: Events that cannot occur together
   – For example, event A=“Male” and B=“Pregnant” are two
      mutually exclusive events (as no males can be pregnant).
• Independent events: The presence or absence of one does not
  alter the chance of the other being present.
   – one event happens regardless of the other, and its outcome is
      not related to the other.
• Probability: If an event can occur in N mutually exclusive and
  equally likely ways, and if m of these possess a characteristic E,
  the probability of the occurrence of E is P(E) = m/N.



                                                                  185
4.1.Properties of probability
1.A probability value must lie between 0 and 1, 0≤P(E)≤1.
    A probability can never be more than 1.0, nor can it be negative
• A value 0 means the event can not occur
• A value 1 means the event definitely will occur
• A value of 0.5 means that the probability that the event will
  occur is the same as the probability that it will not occur.
• Probability is measured on a scale from 0 to 1.0 as shown in in
  the following Figure of probabilty scale.




                                                                        186
Properties…




Fig.___
                        187
Properties…
2. The sum of the probabilities of all mutually exclusive outcome is
   equal to 1.
    P(E1) + P(E2) + .... + P(En) = 1
3. For any two events A and B,
    P(A or B) = P(A) + P(B) -P(A and B)
    (Addition rule)
    For two mutually exclusive events A and B,
    P(A or B ) = P(A) + P(B).
4. For any two independent events A and B
    – P(A and B) = P(A) P(B).
    (Multiplication rule)

                                                                   188
Properties…
• To calculate the probability of event (A) and event (B) happening
  (independent events)for example, if you have two identical packs
  of cards (pack A and pack B),what is the probability of drawing
  the ace of spades from both packs?
• Formula: P(A) x P(B)
   P(pack A) = 1 card, from a pack of 52 cards = 1/52 = 0.0192
   P(pack B) = 1 card, from a pack of 52 cards = 1/52 = 0.0192
   P(A) x P(B) = 0.0192 x 0.0192 = 0.00037


5. If A’ is the complementary event of the event A,
     Then, P(A’) = 1 -P(A).


                                                                 189
Example
• A study investigating the effect of prolonged exposure to bright
   light on retina damage in premature infants. Eighteen of 21
   premature infants, exposed to bright light developed retinopathy,
   while 21 of 39 premature infants exposed to reduced light level
   developed retinopathy. For this sample, the probability of
   developing retinopathy is:
P(Retinopathy) = No. of infants with retinopathy
                       Total No. of infants
= 18 + 21 = 0.65
  21 + 39



                                                                 190
Example…

• The following data are the results of electrocardiograms (ECGs)
  and radionuclide angiocardiograms(RAs) for 19 patients with
  post-traumatic myocardial contusions. A “+”indicates abnormal
  results and a “-”indicates normal results.
• 1.Calculate the probability of both ECG and RA is abnormal
• 2.Calculate the probability that either the ECG or the RA is
  abnormal




                                                               191
Example




          192
Example
Solutions
1.P(ECG abnormal and RA abnormal) = 7/19 = 0.37
2.P(ECG abnormal or RA abnormal) = P(ECG abnormal) + P(RA
   abnormal) –P(Both ECG and RA abnormal)
             =17/19 + 9/19 –7/19 = 19/19 =1
• NB: We can not calculate the above probability by adding the
   number of patients with abnormal ECGs to the number of
   abnormal Ras, I.e. (17+9)/19 = 1.37
• The problem is that the 7 patients whose ECGs and RAs are
   both abnormal are counted twice




                                                            193
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2
Bi ostat for pharmacy.ppt2

Mais conteúdo relacionado

Mais procurados

Introduction to biostatistics by Niraj Kumar Yadav
Introduction to biostatistics by Niraj Kumar YadavIntroduction to biostatistics by Niraj Kumar Yadav
Introduction to biostatistics by Niraj Kumar YadavNiraj Kumar Yadav
 
2. experimental designs
2. experimental designs2. experimental designs
2. experimental designsChanda Jabeen
 
Cross and longitudinal studies
Cross and longitudinal studiesCross and longitudinal studies
Cross and longitudinal studiesSHABBIR AHMAD
 
Error, confounding and bias
Error, confounding and biasError, confounding and bias
Error, confounding and biasAmandeep Kaur
 
Analysis of data in research
Analysis of data in researchAnalysis of data in research
Analysis of data in researchAbhijeet Birari
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statisticsMona Sajid
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsSarfraz Ahmad
 
Human Subject Protection in Clinical Research
Human Subject Protection in Clinical ResearchHuman Subject Protection in Clinical Research
Human Subject Protection in Clinical ResearchClinosolIndia
 
Meta analysis ppt
Meta analysis pptMeta analysis ppt
Meta analysis pptSKVA
 
The Kruskal-Wallis H Test
The Kruskal-Wallis H TestThe Kruskal-Wallis H Test
The Kruskal-Wallis H TestDr. Ankit Gaur
 
Blinding in RCT the enigma unraveled
Blinding in RCT the enigma unraveledBlinding in RCT the enigma unraveled
Blinding in RCT the enigma unraveledMANVEER SINGH
 
Sampling methods 16
Sampling methods   16Sampling methods   16
Sampling methods 16Raj Selvam
 

Mais procurados (20)

Introduction to biostatistics by Niraj Kumar Yadav
Introduction to biostatistics by Niraj Kumar YadavIntroduction to biostatistics by Niraj Kumar Yadav
Introduction to biostatistics by Niraj Kumar Yadav
 
Chi square
Chi squareChi square
Chi square
 
Bias in Research
Bias in ResearchBias in Research
Bias in Research
 
2. experimental designs
2. experimental designs2. experimental designs
2. experimental designs
 
Cross and longitudinal studies
Cross and longitudinal studiesCross and longitudinal studies
Cross and longitudinal studies
 
Histogram
HistogramHistogram
Histogram
 
Study designs
Study designsStudy designs
Study designs
 
Error, confounding and bias
Error, confounding and biasError, confounding and bias
Error, confounding and bias
 
Analysis of data in research
Analysis of data in researchAnalysis of data in research
Analysis of data in research
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Summarizing data
Summarizing dataSummarizing data
Summarizing data
 
Chi square test
Chi square testChi square test
Chi square test
 
Human Subject Protection in Clinical Research
Human Subject Protection in Clinical ResearchHuman Subject Protection in Clinical Research
Human Subject Protection in Clinical Research
 
Kruskal wallis test
Kruskal wallis testKruskal wallis test
Kruskal wallis test
 
Meta analysis ppt
Meta analysis pptMeta analysis ppt
Meta analysis ppt
 
The Kruskal-Wallis H Test
The Kruskal-Wallis H TestThe Kruskal-Wallis H Test
The Kruskal-Wallis H Test
 
Blinding in RCT the enigma unraveled
Blinding in RCT the enigma unraveledBlinding in RCT the enigma unraveled
Blinding in RCT the enigma unraveled
 
Bias and confounding
Bias and confoundingBias and confounding
Bias and confounding
 
Sampling methods 16
Sampling methods   16Sampling methods   16
Sampling methods 16
 

Semelhante a Bi ostat for pharmacy.ppt2

Atoma Research Methodology presentation .pdf
Atoma Research Methodology presentation .pdfAtoma Research Methodology presentation .pdf
Atoma Research Methodology presentation .pdfMitikuTeka1
 
Research Methadology.pptx
Research Methadology.pptxResearch Methadology.pptx
Research Methadology.pptxSurbhit999
 
STUDY DESIGN in health and medical research .pptx
STUDY DESIGN in health and medical research .pptxSTUDY DESIGN in health and medical research .pptx
STUDY DESIGN in health and medical research .pptxAbubakar Hammadama
 
Epidemiology an introduction
Epidemiology an introductionEpidemiology an introduction
Epidemiology an introductionBhoj Raj Singh
 
2010-Epidemiology (Dr. Sameem) basics and priciples.ppt
2010-Epidemiology (Dr. Sameem) basics and priciples.ppt2010-Epidemiology (Dr. Sameem) basics and priciples.ppt
2010-Epidemiology (Dr. Sameem) basics and priciples.pptAmirRaziq1
 
Epidemiological study designs
Epidemiological study designsEpidemiological study designs
Epidemiological study designsjarati
 
Epidemiological Study Designs by zafar sir.pptx
Epidemiological Study Designs by zafar sir.pptxEpidemiological Study Designs by zafar sir.pptx
Epidemiological Study Designs by zafar sir.pptxhamadkhan0185
 
4 Epidemiological Study Designs 1.pdf
4 Epidemiological Study Designs 1.pdf4 Epidemiological Study Designs 1.pdf
4 Epidemiological Study Designs 1.pdfmergawekwaya
 
Epidemiology methods, approaches and tools of measurement
Epidemiology methods, approaches and tools of measurement Epidemiology methods, approaches and tools of measurement
Epidemiology methods, approaches and tools of measurement Swapnilsalve1998
 
Analytic upto surviellance
Analytic upto surviellanceAnalytic upto surviellance
Analytic upto surviellancekaleabtegegne
 
ANALYTICAL STUDIES.pptx
ANALYTICAL STUDIES.pptxANALYTICAL STUDIES.pptx
ANALYTICAL STUDIES.pptxpayalrathod14
 
1 Introduction to Biostatistics.pdf
1 Introduction to Biostatistics.pdf1 Introduction to Biostatistics.pdf
1 Introduction to Biostatistics.pdfbayisahrsa
 
Epidemological studies
Epidemological studies Epidemological studies
Epidemological studies bhuvanesh4668
 
2. Unit 3 Part II - (c) Cross-sectional & longitudinal study 2022.9.19.pdf
2. Unit 3 Part II - (c) Cross-sectional & longitudinal study 2022.9.19.pdf2. Unit 3 Part II - (c) Cross-sectional & longitudinal study 2022.9.19.pdf
2. Unit 3 Part II - (c) Cross-sectional & longitudinal study 2022.9.19.pdfAshesh1986
 
Epidemiology introduction
Epidemiology introductionEpidemiology introduction
Epidemiology introductionPapiya Mazumdar
 
1_Intro to Research.pdf
1_Intro to Research.pdf1_Intro to Research.pdf
1_Intro to Research.pdfMharCastro
 
EPIDEMIOLOGY ppt 2.pptx
EPIDEMIOLOGY ppt 2.pptxEPIDEMIOLOGY ppt 2.pptx
EPIDEMIOLOGY ppt 2.pptxshilpas275123
 

Semelhante a Bi ostat for pharmacy.ppt2 (20)

Atoma Research Methodology presentation .pdf
Atoma Research Methodology presentation .pdfAtoma Research Methodology presentation .pdf
Atoma Research Methodology presentation .pdf
 
Research Methadology.pptx
Research Methadology.pptxResearch Methadology.pptx
Research Methadology.pptx
 
STUDY DESIGN in health and medical research .pptx
STUDY DESIGN in health and medical research .pptxSTUDY DESIGN in health and medical research .pptx
STUDY DESIGN in health and medical research .pptx
 
Epidemiology an introduction
Epidemiology an introductionEpidemiology an introduction
Epidemiology an introduction
 
2010-Epidemiology (Dr. Sameem) basics and priciples.ppt
2010-Epidemiology (Dr. Sameem) basics and priciples.ppt2010-Epidemiology (Dr. Sameem) basics and priciples.ppt
2010-Epidemiology (Dr. Sameem) basics and priciples.ppt
 
Epidemiological study designs
Epidemiological study designsEpidemiological study designs
Epidemiological study designs
 
introduction.pptx
introduction.pptxintroduction.pptx
introduction.pptx
 
Epidemiological Study Designs by zafar sir.pptx
Epidemiological Study Designs by zafar sir.pptxEpidemiological Study Designs by zafar sir.pptx
Epidemiological Study Designs by zafar sir.pptx
 
4 Epidemiological Study Designs 1.pdf
4 Epidemiological Study Designs 1.pdf4 Epidemiological Study Designs 1.pdf
4 Epidemiological Study Designs 1.pdf
 
Cohort study
Cohort studyCohort study
Cohort study
 
Epidemiology methods, approaches and tools of measurement
Epidemiology methods, approaches and tools of measurement Epidemiology methods, approaches and tools of measurement
Epidemiology methods, approaches and tools of measurement
 
Analytic upto surviellance
Analytic upto surviellanceAnalytic upto surviellance
Analytic upto surviellance
 
ANALYTICAL STUDIES.pptx
ANALYTICAL STUDIES.pptxANALYTICAL STUDIES.pptx
ANALYTICAL STUDIES.pptx
 
Study types
Study typesStudy types
Study types
 
1 Introduction to Biostatistics.pdf
1 Introduction to Biostatistics.pdf1 Introduction to Biostatistics.pdf
1 Introduction to Biostatistics.pdf
 
Epidemological studies
Epidemological studies Epidemological studies
Epidemological studies
 
2. Unit 3 Part II - (c) Cross-sectional & longitudinal study 2022.9.19.pdf
2. Unit 3 Part II - (c) Cross-sectional & longitudinal study 2022.9.19.pdf2. Unit 3 Part II - (c) Cross-sectional & longitudinal study 2022.9.19.pdf
2. Unit 3 Part II - (c) Cross-sectional & longitudinal study 2022.9.19.pdf
 
Epidemiology introduction
Epidemiology introductionEpidemiology introduction
Epidemiology introduction
 
1_Intro to Research.pdf
1_Intro to Research.pdf1_Intro to Research.pdf
1_Intro to Research.pdf
 
EPIDEMIOLOGY ppt 2.pptx
EPIDEMIOLOGY ppt 2.pptxEPIDEMIOLOGY ppt 2.pptx
EPIDEMIOLOGY ppt 2.pptx
 

Bi ostat for pharmacy.ppt2

  • 3. 1.1.Introduction to Research What is Research? • A scientific study to seek hidden knowledge • A scientific study to answer a question • A scientific study of causes and effects • A scientific attempt towards new discoveries • A systematic method of inquiry • A logical attempt to find answers to problems • A systematic approach to a (medical) problem 3
  • 4. Statistical Concept of Research • Research is a systematic collection, analysis and interpretation of data in order to solve a research question • It is classified as: – Basic research: necessary to generate new knowledge and technologies. – Applied research: necessary to identify priority problems and to design and evaluate policies and programs for optimal health care and delivery. 4
  • 5. 1.2. Types of Epidemiological Design A. Descriptive studies • Mainly concerned with the distribution of diseases with respect to time, place and person. • Useful for health managers to allocate resource and to plan effective prevention programmes. • Useful to generate epidemiological hypothesis, an important first step in the search for disease determinant or risk factors. • Can use information collected routinely which are readily available in many places. So generally descriptive studies are less expensive and less time-consuming than analytic studies. 5
  • 6. • It is the most common type of epidemiological design strategy in medical literature. • There are three main types: – Correlational – Case report or case series – Cross-section 6
  • 7. A.1. Correlational or Ecological • Uses data from entire population to compare disease frequencies – between different groups during the same period of time, or in the same population at different points in time. • Does not provide individual data, rather presents average exposure level in the community. • Cause could not be ascertained. • Correlation coefficient is the measure of association in correlational studies. It is important to note that positive association does not necessarily imply a valid statistical association. 7
  • 8. Eg. • Hypertension rates and average per capita salt consumption compared between two communities. • Average per capita fat consumption and breast cancer rates compared between two communities. • Comparing incidence of dental cares in relation to fluoride content of the water among towns in the rift valley. • Mortality from CHD in relation to per capita cigarette sales among the regions of Ethiopia. 8
  • 9. • Strength: Can be done quickly and inexpensively, often using available data. • Limitation: – Inability to link exposure with disease. – Lack of ability to control for effects of potential confounding factors. There may be other things that at the true cause. – It may mask a non-linear relationship between exposure and disease. For example alcohol consumption and mortality from CHD have a non- linear relationship (the curve is “J” shaped), 9
  • 10. A.2. Case Report and Case Series • Describes the experience of a single or a group of patients with similar diagnosis. Has limited value, but occasionally revolutionary. • E.g. 5 young homosexual men with PCP seen between Oct. 1980 and May 1981 in Los Angeles arose concern among physicians. Later, with further follow-up and thorough investigation of the strange occurrence of the disease the diagnosis of AIDS was established for the first time. 10
  • 11. • Strength: – very useful for hypothesis generation. • Limitations: – Report is based on single or few patients, which could happen just by coincidence. Lack of an appropriate comparison group 11
  • 12. A.3. Cross Sectional Studies (Survey • Information about the status of an individual with respect to the presence or absence of exposure and disease is assessed at the same point in time. Easy to do-many surveys are like this. • For factors that remain unaltered overtime, such as sex, race or blood group, the cross-sectional survey can provide evidence of a valid statistical association. • Useful for raising the question of the presence of an association rather than for testing a hypothesis. 12
  • 13. B. ANALYTIC STUDIES • Focuses on the determinants of a disease by testing the hypothesis formulated from descriptive studies, with the ultimate goal of judging whether a particular exposure causes or prevents disease. • Broadly classified into two – observational and interventional studies. – Both types use “controls”. The use of controls is the main distinguishing feature of analytic studies. 13
  • 14. B.1. Observational studies • Information are obtained by observation of events. No intervention is done. Cohort and case-control are in this category. i. Cohort • Subjects are selected by exposure, or determinants of interest, and followed to see • If they develop the disease or outcome interest. • E.g. Follow 100 children who received BCG vaccination and another 100 who didn’t get BCG vaccination and see how many of them get tuberculosis. 14
  • 15. • ii. Case Control • Subjects are selected with respect to presence or absence of disease, or outcome of interest, and then inquiries are made about past exposure to the factor(s) of interest. • E.g. Take people with and without TB, ask them if they ever had BCG vaccination. 15
  • 16. B.2. Interventional / Experimental • The researcher does something about the disease or exposure and observe the changes. • Investigator has control over who gets exposure and who don’t. The key is that the investigator assign into either group, whether it is done randomly or not. • Always prospective. • E.g. Assign children randomly to get chloroquine or not, and see how many develop symptomatic malaria. 16
  • 17. Description of common terms Statistics- It is the process of scientifically collecting, organizing, summarizing and interpreting of data, and the drawing of inferences about a body of data when only part of the data are observed. Biostatistics- It is a special statistics in which the data being analyzed are derived from biological and medical science Descriptive statistics: A statistical method that is concerned with the collection, organization, summarization, and analysis of data from a sample of population. Inferential statistics: A statistical method that is concerned with the drawing of inferences/ conclusions about a particular population by selecting and measuring a random sample from the population. 17
  • 18. Population: Is the largest collection of entities/values of a random variable for which we have an interest at a particular time. Population could be finite or infinite. We can take the whole number of students in a given class (e.g. 100 students) as a population. • Target population: A collection of items that have something in common for which we wish to draw conclusions at a particular time. • Study Population: The specific population from which data are collected 18
  • 19. Sample: It is some part/subset of population of interest. In the above example, if we randomly select 25 students from the 100, we call the former as sample of the class. Hence, Generalizability is a two-stage procedure: we want to a generalize from the sample to the study population and then from the study population to the target population 19
  • 20. Eg.: In a study of the prevalence of HIV among orphan children in Ethiopia, a random sample of orphan children in LidetaKifle Ketema were included. Target Population: All orphan children in Ethiopia Study population: All orphan children in Addis Ababa Sample: Orphan children in Lideta KifleKetema 20
  • 21. Statistical inference: It is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample that has been drawn from that population. Parameter: It is numerical expression of population measurements E.g. population mean (µ), population variance, population standard deviation, etc  A descriptive measure computed from the data of a population. Statistic: A descriptive measure computed from the data of a sample. Statistical data: Information that is systematically collected tabulated and analysis for which the result is interpreted to draw conclusions about the result obtained. 21
  • 22. • Data: aggregate of variables as a result of measurement or counting. • Variable: A characteristics that takes on different values in different persons, places, or things. – Dependent variable(response) :variable (s)we measure as an out come of interest – Independent variable(predictor) :The variable(S) that determines the outcome 22
  • 23. Categorical variable: The notion of magnitude is absent or implicit. – Nominal: have distinct levels that have no inherent ordering. – When only with two categories, are called binary or dichotomous.Eg. Sex; male or female – When more than two categories -are called polythumous eg color – Ordinal: have levels that do follow a distinct ordering. Eg. severity of pain(mild, moderate severe) 23
  • 24. Quantitative(numeric) variable: Variable that has magnitude • Discrete data: when numbers represent actual measurable quantities rather than mere labels.  Discrete data are restricted to taking only specified values often integers or counts that differ by fixed amounts. e.g. Number of new AIDS cases reported during one year period, Number of beds available in a particular hospital • Continuous data: represent measurable quantities but are not restricted to taking on certain specific values i.e fractional values are possible. Can use interval (no true zero value) or ratio scale (begins at zero) – e.g. weight, cholesterol level, time, temperature 24
  • 25. 1.3.Sampling Methods Sampling • The process of selecting a portion of the population to represent the entire population. • A main concern in sampling: – Ensure that the sample represents the population, and • The findings can be generalized. 25
  • 26. Advantages of sampling: • Feasibility: Sampling may be the only feasible method of collecting information. • Reduced cost: Sampling reduces demands on resource such as finance, personnel, and material. • Greater accuracy: Sampling may lead to better accuracy of collecting data • Sampling error: Precise allowance can be made for sampling error • Greater speed: Data can be collected and summarized more quickly 26
  • 27. Disadvantages of sampling: • There is always a sampling error. • Sampling may create a feeling of discrimination within the population. • Sampling may be inadvisable where every unit in the population is legally required to have a record. Errors in sampling 1) Sampling error: Errors introduced due to selection of a sample. – They cannot be avoided or totally eliminated. 2) Non-sampling error: - Observational error - Respondent error - Lack of preciseness of definition - Errors in editing and tabulation of data 27
  • 28. Divisions of Sampling Methods Two broad divisions: A. Probability sampling methods B. Non-probability sampling methods 28
  • 29. 1.4.1. Probability sampling • Involves random selection of a sample • A sample is obtained in a way that ensures every member of the population to have a known, non zero probability of being included in the sample. • Involves the selection of a sample from a population, based on chance. 29
  • 30. • Probability sampling is: – more complex, – more time-consuming and – usually more costly than non-probability sampling. • However, because study samples are randomly selected and their probability of inclusion can be calculated, – reliable estimates can be produced and • inferences can be made about the population. 30
  • 31. • There are several different ways in which a probability sample can be selected. • The method chosen depends on a number of factors, such as – the available sampling frame, – how spread out the population is, – how costly it is to survey members of the population 31
  • 32. Most common probability sampling methods 1. Simple random sampling 2. Systematic random sampling 3. Stratified random sampling 4. Cluster sampling 5. Multi-stage sampling 32
  • 33. 1. Simple random sampling(SRS) • Involves random selection • Each member of a population has an equal chance of being included in the sample. • To use a SRS method: – Make a numbered list of all the units in the population – Each unit should be numbered from 1 to N (where N is the size of the population) – Select the required number. 33
  • 34. • The randomness of the sample is ensured by: • use of “lottery’ methods • a table of random numbers – Using computer programes • Example • Suppose your school has 500 students and you need to conduct a short survey on the quality of the food served in the cafeteria. • You decide that a sample of 10 students should be sufficient for your purposes. • In order to get your sample, you assign a number from 1 to 500 to each student in your school. 34
  • 35. • To select the sample, you use a table of randomly generated numbers. • Pick a starting point in the table (a row and column number) and look at the random numbers that appear there. In this case, since the data run into three digits, the random numbers would need to contain three digits as well. • Ignore all random numbers after 500 because they do not correspond to any of the students in the school. • Remember that the sample is without replacement, so if a number recurs, skip over it and use the next random number. • The first 10 different numbers between 001 and 500 make up your sample 35
  • 36. • SRS has certain limitations: – Requires a sampling frame. – Difficult if the reference population is dispersed. – Minority subgroups of interest may not be selected. 36
  • 37. 2. Systematic random sampling • Sometimes called interval sampling, systematic sampling means that there is a gap, or interval, between each selected unit in the sample • The selection is systematic rather than randomly – Individuals are chosen at regular interval from the sampling frame. Ideally we randomly select a number to tell us where to start selecting individuals from the list. • Important if the reference population is arranged in some order: – Order of registration of patients – Numerical number of house numbers – Student’s registration books – Taking individuals at fixed intervals (every kth) based on the sampling fraction, eg. if the sample includes 20%, then every fifth. 37
  • 38. Steps in systematic random sampling 1. Number the units on your frame from 1 to N (where N is the total population size). 2. Determine the sampling interval (K) by dividing the number of units in the population by the desired sample size. 38
  • 39. Steps… .In order to find one study unit, during survey, it is important to figure out how many houses must be visited usually through doing a pilot study. • Example: Assume you are doing a study involving children under 5. There are 1500 households in all, and you have a required sample size of 100 children. From a preliminary study you have done, there is one child every 2.5 households. Normally, if there were a child in every household, you would visit 100 households. But because not every household includes a child, you will need to visit 100 x 2.5 or 250 households to find the required 100 children. • The sampling interval will therefore be1500/250 or every 6th household. 39
  • 40. 3. Select a number between one and K at random. This number is called the random start and would be the first number included in your sample. 4. Select every Kth unit after that first number Note: Systematic sampling should not be used when a cyclic repetition is inherent in the sampling frame. 40
  • 41. Example To select a sample of 100 from a population of 400, you would need a sampling interval of 400 ÷ 100 = 4. Therefore, K = 4. You will need to select one unit out of every four units to end up with a total of 100 units in your sample. Select a number between 1 and 4 from a table of random numbers. • If you choose 3, the third unit on your frame would be the first unit included in your sample; • The sample might consist of the following units to make up a sample of 100: 3 (the random start), 7, 11, 15, 19...395, 399 (up to N, which is 400 in this case). 41
  • 42. The main difference with SRS, any combination of 100 units would have a chance of making up the sample, while with systematic sampling, there are only four possible samples. 42
  • 43. Advantages .   • Systematic sampling is usually less time consuming and easier to perform than SRS • It provides a good approximation to SRS (. i.e. has highest precision) • Unlike SRS, systematic sampling can be conducted without a sampling frame. So, systematic random sampling is useful when preparing sampling frame is not readily available. – E.g. In patients attending a health center, where it is not possible to predict in advance who will be attending 43
  • 44. Disadvantage • If there is any sort of cyclic pattern in the ordering of the subjects, which coincides with the sampling interval, the sample will not be representative of the population. – May result in systematic error 44
  • 45. 3. Stratified random sampling • It is done when the population is known to have heterogeneity with regard to some factors and those factors are used for stratification • Using stratified sampling, the population is divided into homogeneous, mutually exclusive groups called strata, and – A population can be stratified by any variable that is available for all units prior to sampling (e.g., age, sex, province of residence, income, etc.). • A separate sample is taken independently from each stratum. • Any of the sampling methods mentioned in this section (and others that exist) can be used to sample within each stratum. 45
  • 46. Why do we need to create strata? • That it can make the sampling strategy more efficient. • A larger sample is required to get a more accurate estimation if a characteristic varies greatly from one unit to the other. • For example, if every person in a population had the same salary, then a sample of one individual would be enough to get a precise estimate of the average salary. • This is the idea behind the efficiency gain obtained with stratification. – If you create strata within which units share similar characteristics (e.g., income) and are considerably different from units in other strata (e.g., occupation, type of dwelling) then you would only need a small sample from each stratum to get a precise estimate of total income for that stratum. 46
  • 47. – Then you could combine these estimates to get a precise estimate of total income for the whole population. • If you use a SRS approach in the whole population without stratification, the sample would need to be larger than the total of all stratum samples to get an estimate with the same level of precision. 47
  • 48. • Stratified sampling ensures an adequate sample size for sub- groups in the population of interest. • When a population is stratified, each stratum becomes an independent population and you will need to decide the sample size for each stratum. 48
  • 49. • Equal allocation: – Allocate equal sample size to each stratum • Proportionate allocation: , j = 1, 2, ..., k where, k is the number of strata and n nj = Nj N – nj is sample size of the jth stratum – Nj is population size of the jth stratum – n = n1 + n2 + ...+ nk is the total sample size – N = N1 + N2 + ...+ Nk is the total population size 49
  • 50. 4. Cluster sampling • Sometimes it is too expensive to spread a sample across the population as a whole. • Travel costs can become expensive if interviewers have to survey people from one end of the country to the other. • To reduce costs, researchers may choose a cluster sampling technique • The clusters should be homogeneous, unlike stratified sampling where by the strata are heterogeneous 50
  • 51. Steps in cluster sampling • Cluster sampling divides the population into groups or clusters. • A number of clusters are selected randomly to represent the total population, and then all units within selected clusters are included in the sample. • No units from non-selected clusters are included in the sample— they are represented by those from selected clusters. • This differs from stratified sampling, where some units are selected from each group. 51
  • 52. Example • In a school based study, we assume students of the same school are homogeneous. • We can select randomly sections and include all students of the selected sections only 52
  • 53. • As mentioned, cost reduction is a reason for using cluster sampling. • It creates 'pockets' of sampled units instead of spreading the sample over the whole territory. • Another reason is that sometimes a list of all units in the population is not available, while a list of all clusters is either available or easy to create. 53
  • 54. • In most cases, the main drawback is a loss of efficiency when compared with SRS. • It is usually better to survey a large number of small clusters instead of a small number of large clusters. – This is because neighboring units tend to be more alike, resulting in a sample that does not represent the whole spectrum of opinions or situations present in the overall population. 54
  • 55. • Another drawback to cluster sampling is that you do not have total control over the final sample size. • Since not all schools have the same number of (say Grade 11) students and city blocks do not all have the same number of households, and you must interview every student or household in your sample, as an example, the final size may be larger or smaller than you expected. 55
  • 56. 5. Multi-stage sampling • Similar to the cluster sampling, except that it involves picking a sample from within each chosen cluster, rather than including all units in the cluster. • This type of sampling requires at least two stages. 56
  • 57. • In the first stage, large groups or clusters are identified and selected. These clusters contain more population units than are needed for the final sample. • In the second stage, population units are picked from within the selected clusters (using any of the possible probability sampling methods) for a final sample. 57
  • 58. • If more than two stages are used, the process of choosing population units within clusters continues until there is a final sample. • With multi-stage sampling, you still have the benefit of a more concentrated sample for cost reduction. • However, the sample is not as concentrated as other clusters and the sample size is still bigger than for a simple random sample size. 58
  • 59. • Also, you do not need to have a list of all of the units in the population. All you need is a list of clusters and list of the units in the selected clusters. • Admittedly, more information is needed in this type of sample than what is required in cluster sampling. However, multi-stage sampling still saves a great amount of time and effort by not having to create a list of all the units in a population. 59
  • 60. 1.4.2.. Non-probability sampling • The difference between probability and non-probability sampling has to do with a basic assumption about the nature of the population under study. • In probability sampling, every item has a known chance of being selected. • In non-probability sampling, there is an assumption that there is an even distribution of a characteristic of interest within the population. 60
  • 61. • This is what makes the researcher believe that any sample would be representative and because of that, results will be accurate. • For probability sampling, random is a feature of the selection process, rather than an assumption about the structure of the population. 61
  • 62. • In non-probability sampling, since elements are chosen arbitrarily, there is no way to estimate the probability of any one element being included in the sample. • Also, no assurance is given that each item has a chance of being included, making it impossible either to estimate sampling variability or to identify possible bias 62
  • 63. • Reliability cannot be measured in non-probability sampling; the only way to address data quality is to compare some of the survey results with available information about the population. • Still, there is no assurance that the estimates will meet an acceptable level of error. • Researchers are reluctant to use these methods because there is no way to measure the precision of the resulting sample. 63
  • 64. • Despite these drawbacks, non-probability sampling methods can be useful when descriptive comments about the sample itself are desired. • Secondly, they are quick, inexpensive and convenient. • There are also other circumstances, such as researches, when it is unfeasible or impractical to conduct probability sampling. 64
  • 65. common types of non-probability sampling 1. Convenience or haphazard sampling 2. Volunteer sampling 3. Judgment sampling 4. Quota sampling 5. Snowball sampling technique 65
  • 66. 1.4.Scales of measurement • Measurement: the assignment of numbers or names or events according to a set of rules: • Clearly not all measurements are the same. • Measuring an individuals weight is qualitatively different from measuring their response to some treatment on a three category of scale, “improved”, “stable”, “not improved”. • Measuring scales are different according to the degree of precision involved. • There are four types of scales of measurement. 66
  • 67. Scales… 1. Nominal scale: uses names, labels, or symbols to assign each measurement to one of a limited number of categories that cannot be ordered. – Examples: Blood type, sex, race, marital status 2. Ordinal scale: assigns each measurement to one of a limited number of categories that are ranked in terms of a graded order. – Examples: Patient status, Cancer stages 67
  • 68. Scales… 3. Interval scale: assigns each measurement to one of an unlimited number of categories that are equally spaced. It has no true zero point. – Example: Temperature measured on Celsius or Fahrenheit 4.Ratio scale: measurement begins at a true zero point and the scale has equal space. – Eg: Height, weight, blood pressure 68
  • 69. Scales… 69
  • 70. 1.5.Validity and reliability Validity and Reliability are two major requirements for any measurement. – Validity pertains to the correctness of the measure; a valid tool measures what it is supposed to measure. – Reliability pertains to the consistency of the tool across different contexts. • Validity is often described as internal or external. 70
  • 71. 1.6.Sources and methods of data Collection and it’s handling Sources Two major sources Primary sources-are those data, which are collected by the investigator himself/herself for the purpose of a specific inquiry or study. Such data are original in character and are mostly generated by surveys conducted by individuals or research institutions. The first hand information obtained by the investigator is more reliable and accurate since the investigator can extract the correct information by removing doubts, if any, in the minds of the respondents regarding certain questions. High response rates might be obtained since the answers to various questions are obtained on the spot. It permits explanation of questions concerning difficult subject matter. 71
  • 72. Secondary data Secondary Data: When an investigator uses data, which have already been collected by others, such data are called "Secondary Data". Such data are primary data for the agency that collected them, and become secondary for someone else who uses these data for his/her own purposes. The secondary data can be obtained from journals, reports of different institutions, government publications, publications of professionals and research organizations. These data are less expensive and can be collected in a short time. 72
  • 73. Data collection methods 1.Observation • is a technique that involves systematically selecting, watching and recoding behaviours of people or other phenomena and aspects of the setting in which they occur, for the purpose of getting specified information. • includes all methods from simple visual observations to the use of high level machines and measurements, sophisticated equipment or facilities, such as radiographic, biochemical, X-ray machines, microscope, clinical examinations, and microbiological examinations. 73
  • 74. Observation… • Advantages: Gives relatively more accurate data on behaviour and activities • Disadvantages: Investigators or observer’s own biases, prejudice, desires, and etc. . • needs more resources and skilled human power during the use of high level machines. 74
  • 75. 2. The Documentary sources • Include clinical records and other personal records, published mortality statistics, census publications, etc. • Advantages: a) Documents can provide ready-made information relatively easily b) The best means of studying past events • Disadvantages: a) Problems of reliability and validity (because the information is collected by a number of different persons who may have used different definitions or methods of obtaining data). b) There is a possibility that errors may occur when the information is extracted from the records . 75
  • 76. 3. Interviews and self-administered questionnaire a) Interviews: may be less or more structured. A public health worker conducting interviews may be armed with a checklist of topics, but may not decide in advance precisely what questions he/she will ask. • This approach is flexible; the content, wording and order of the questions are relatively unstructured. – the content, wording and order of the questions vary from interview to interview. 76
  • 77. Interviews… On the other hand, in other situations a more standardized technique may be used, the wording and order of the questions being decided in advance. This may take the form of a highly structured interview(interviewing using questionnaire), • the investigator appoints persons/enumerators, who go to the respondents personally with the questionnaire, ask them questions and record their replies. – This can be done using telephone or face-to-face interviews. 77
  • 78. Interviews… • Questions may take two general forms: they may be “open ended” questions, which the subject answers in his/her own words, • or “closed” questions, which are answered by choosing from a number of fixed alternative responses. 78
  • 79. Advantage of interview • A good interviewer can stimulate and maintain the respondent’s interest. This leads to the frank answering of questions. • If anxiety is aroused (e.g., why am I being asked these questions?) , the interviewer can allay it. An interviewer: • can repeat questions which are not understood, and give standardized explanations where necessary. • can ask “follow-up” or “probing” questions to clarify a response. • can make observations during the interview; • i.e., note is taken not only of what the subject says but also how he/she says it. 79
  • 80. b. self-administered questionnaire • The respondent reads the questions and fills in the answers by himself/herself (sometimes in the presence of an interviewer who “stands by” to give assistance if necessary). • The use of self-administered questionnaires is simpler and cheaper; • can be administered to many persons simultaneously (e.g. to a class of school children). • They can be sent by post. However, they demand a certain level of education on the part of the respondent. 80
  • 81. . • Quantitative data are commonly collected using structured interviews (where standard questionnaires are common and the collected data can relatively be processed easily) where as, • qualitative data are usually collected using unstructured interviews. • The unstructured interviews are undertaken by the help of check lists, key informant interviews, focus group discussions, etc. 81
  • 82. Qualitative… Checklist - is a list of questions prepared ahead of time to facilitate the interviews or discussions. It is not an exhaustive one. It helps the facilitator not to miss any of the important topics under consideration. Key informant interviews – interviews done with influential individuals (such as community elders, priests, etc.). Focus group discussions – discussions made with a group of respondents. • The group contains 6 to 12 people who are more or less similar with respect to level of education, marital status, age, sex, etc. (this composition helps each respondent to talk freely without being dominated by the other). 82
  • 83. Steps in Questionnaire Design 1. Before beginning to construct, make sure that the questionnaire is the best method of collecting data for your objectives – To know before hand what information is needed and what is going to be done with this information 2. While drafting the questions one has to know: Why question is asked and what will be done with information (to prevent wastage of extra resources) 83
  • 84. Steps in… 3. To get valid and reliable information: • the wording and sequence of question should be able to facilitate their recall or remember • prevent forgetfulness of the respondents • avoid difficult/ time consuming or embarrassing or too personal question • the flow of questions should be from simple to complex and from general to specific, from impersonal to personal • confidentiality care should be taken for the respondent • Cover letter( if by mail) • Identify by ID(rather than name) 84
  • 85. Data Collection and handling Process 85
  • 86. Data collection A plan for data collection can be made in two steps: 1. Listing the tasks that have to be carried out and who should be involved, making a rough estimate of the time needed for the different parts of the study, and identifying the most appropriate period in which to carry out the research 2. Actually scheduling the different activities that have to be carried out each week in a work plan 86
  • 87. Why should you develop a plan for data collection? A plan for data collection should be developed so that: – you will have a clear overview of what tasks have to be carried out, who should perform them, and the duration of these tasks; – you can organize both human and material resources for data collection in the most efficient way; and – you can minimize errors and delays which may result from lack of planning (for example, the population not being available or data forms being misplaced). 87
  • 88. Data collection process Stages • Stage 1: Permission to proceed – Obtaining consent from the relevant authorities, individuals and the community in which the project is to be carried out 88
  • 89. Data collection process Stage 2: Data collection • Logistics – who will collect what, – when and – with what resources • Quality control – Prepare a field work manual – Select your research assistants – Train research assistants – Supervision – Checked for completeness and accuracy 89
  • 90. Data collection process • How long will it take to collect the data for each component of the study? – Step 1: Consider the time required to reach the study area; to locate the study units; the number of visits required per study unit and for follow-up of non- respondents – Step 2: Calculate the number of interviews that can be carried out per person per day – Step 3: Calculate the number of days needed to carry out the interviews. 90
  • 91. Ensuring data quality Measures to help ensure good quality of data:  Prepare a field work manual for the research team as a whole  Select your research assistants, if required, with care  Train research assistants carefully in all topics covered in the field work manual as well as in interview techniques  Pre-test research instruments and research procedures with the whole research team, including research assistants. 91
  • 92. Ensuring data quality  Take care that research assistants are not placed under too much stress  Arrange for on-going supervision of research assistants and guidelines should be developed for supervisory tasks.  Devise methods to assure the quality of data collected by all members of the research team. 92
  • 93. Data Collection Process Stage 3: Data handling • Once the data have been collected and checked for completeness and accuracy, a clear procedure should be developed for handling and storing them • Numbering of all questionnaires • Identify the person responsible for storing data and the place where it will be stored • Decide how data should be stored. Record forms should be kept in the sequence in which they have been numbered. 93
  • 94. Research Assistants • This includes – data collectors, supervisors and may be local guides • Selection – during selection one should consider similarities in educational level and may be sex composition • Training – all research assistants and team members should be trained together 94
  • 95. Pre-test and pilot study A pre-test usually refers to a small-scale trial of particular research components. A pilot study is the process of carrying out a preliminary study, going through the entire research procedure with a small sample. Why do we carry out a pre-test or pilot study? A pre-test or pilot study serves as a trial run that allows us to identify potential problems in the proposed study. 95
  • 96. Pre-test and pilot study What aspects of your research methodology can be evaluated during pre-testing? 1. Reactions of the respondents to the research procedures can be observed in the pre-test – availability and willingness 2. The data-collection tools can be pre-tested 3. Sampling procedures can be checked 4. Staffing and activities of the research team can be checked, while all are involved in the pre-test 5. Procedures for data processing and analysis can be evaluated during the pre-test 6. The proposed work plan and budget for research activities can be assessed during the pre-test. 96
  • 97. Plan for data processing & analysis • Data processing and analysis should start in the field, with checking for completeness of the data and • Performing quality control checks, while sorting the data by instrument used and by group of informants • Data of small samples may even be processed and analyzed as soon as it is collected. 97
  • 98. Plan for data processing & analysis • The plan for data processing and analysis must be made after careful consideration of the objectives of the study as well as of the tools developed to meet the objectives. • The procedures for the analysis of data collected through qualitative and quantitative techniques are quite different. – For quantitative data the starting point in analysis is usually a description of the data for each variable – For qualitative data it is more a matter of describing, summarizing and interpreting the data obtained for each study unit 98
  • 99. Plan for data processing & analysis • When making a plan for data processing and analysis the following issues should be considered: – Sorting data, –  Performing quality-control checks, –  Data processing, and –  Data analysis. 99
  • 100. Data processing and analysis • Sorting data – Into groups of different study populations or comparison groups • Quality control checks – Check again for completeness and internal consistency – Missing data - if many exclude the questionnaire – Inconsistency - correct, return or exclude 100
  • 101. Data processing • Decide whether to process and analyse the data from questionnaires: – manually, using data master sheets or manual compilation of the questionnaires, or – by computer, for example, using a micro-computer and existing software or self-written programmes for data analysis. • Data processing in both cases involves: • categorising the data, • coding, and • summarising the data in data master sheets, manual compilation without master sheets, or • data entry and verification by computer. 101
  • 102. 2.Descriptive statistics (Data summarization) 102
  • 103. 2.Data summarization(Descriptive statistics) 2.1.Describing variables The methods of describing variables differ depending on the type of data  Categorical or Numerical Some times we transform numeric data into categorical.eg age. – when lesser degree detail is required • This is achieved by dividing the range of values, which the numeric variable takes into intervals. 103
  • 104. Describing… Categorical variables • Table of frequency distributions – Frequency – Relative frequency – Cumulative frequencies • Charts – Bar charts – Pie charts 104
  • 106. In summary, • There are three ways we can summarize and present data: • Tabular representation - summarizing data by making a table of the data called frequency distributions. • Graphical representation of data - we can make a graph of the data. • Numerical representation of data - we can use a single number to represent many numbers. – Measures of central tendency. – Measures of variability. 106
  • 107. 2.2. Frequency Distribution • A frequency distribution shows the number of observations falling into each of several ranges of values. • Four different types of frequency distributions. – Simple frequency distribution (or it can be just called a frequency distribution). – Cummulative frequency distribution. – Grouped frequency distribution. – Cummulative grouped frequency distribution. • Are portrayed as Frequency tables, histograms, or polygons • Can show either the actual number of observations falling in each range or the percentage of observations. In the latter instance, the distribution is called a relative frequency distribution 107
  • 108. Simple frequency distribution Consider the following set of data which are the high temperatures recorded for 30 consecutive days. We wish to summarize this data by creating a frequency distribution of the temperatures. Data Set - High Temperatures for 30 Days 50 45 49 50 43 49 50 49 45 49 47 47 44 51 51 44 47 46 50 44 51 49 43 43 49 45 46 45 51 46 108
  • 109. Simple frequency distribution… To create a frequency distribution from this data proceed as follows: . 1. Identify the highest and lowest values in the data set. For our temperatures the highest temperature is 51 and the lowest temperature is 43. 2. Create a column with the title of the variable we are using, in this case temperature. Enter the highest score at the top, and include all values within the range from the highest score to the lowest score. 109
  • 110. Simple frequency… 3. Create a tally column to keep track of the scores as you enter them into the frequency distribution. Once the frequency distribution is completed you can omit this column 4. Create a frequency column, with the frequency of each value, as show in the tally column, recorded. 5. At the bottom of the frequency column record the total frequency for the distribution proceeded by N = 6. Enter the name of the frequency distribution at the top of the table. 110
  • 111. Simple frequency… If we applied these steps to the temperature data above we would have the following frequency distribution Frequency Distribution for High Temperatures Temperature Tally Frequency 51 //// 4 50 //// 4 49 //// / 6 48 0 47 /// 3 46 /// 3 45 //// 4 44 /// 3 43 /// 3 N = 30 111
  • 112. Cumulative frequency distribution To create a cummulative frequency distribution: • Create a frequency distribution • Add a column entitled cummulative frequency • The cummulative frequency for each score is the frequency up to and including the frequency for that score • The highest cummulative frequency should equal N (the total of the frequency column) 112
  • 113. Cumulative frequency… Cummulative Frequency Distribution for High Temperatures Temperature Tally Frequency Cummulative Frequency 51 //// 4 30 50 //// 4 26 49 ////// 6 22 48 0 16 47 /// 3 16 46 /// 3 13 45 //// 4 10 44 /// 3 6 43 /// 3 3 N= 30 113
  • 114. Grouped frequency distribution To create a grouped frequency distribution: • select an interval size so that you have 7-20 class intervals  Al so By using surges’ rule • create a class interval column and list each of the class intervals • each interval must be the same size, they must not overlap, there may be no gaps within the range of class intervals • create a tally column (optional) • create a midpoint column for interval midpoints • create a frequency column • enter N = some value at the bottom of the frequency column 114
  • 115. Grouped frequency for the temperature data Grouped Frequency Distribution for High Temperatures Class Interval Tally Interval Midpoint Frequency 57-59 ////// 58 6 54-56 /////// 55 7 51-53 /////////// 52 11 48-50 ///////// 49 9 45-47 /////// 46 7 42-44 ////// 43 6 39-41 //// 40 4 N= 50 115
  • 116. Cumulative grouped frequency distribution We just add a cumulative frequency column to the grouped frequency distribution and we have a cumulative grouped frequency distribution as shown below. Cumulative Grouped Frequency Distribution for High Temperatures Class Interval Tally Interval Midpoint Frequency Cumulative Frequency 57-59 ////// 58 6 50 54-56 /////// 55 7 44 51-53 /////////// 52 11 37 48-50 ///////// 49 9 26 45-47 /////// 46 7 17 42-44 ////// 43 6 10 39-41 //// 40 4 4 N= 50 116
  • 117. Relative Frequency • Sometimes it is useful to compute the proportion, or percentages of observations in each category. • Relative frequency of a particular category is the proportion(fracttion) of observations that fall into the particular category. • The cumulative frequency (or proportions) is addition of the frequencies in each category from zero to a particular category. – Is the relative frequency of items less than or equal to the upper class limit of each class. • For quantitative data and for categorical (qualitative) data (but only if the latter are ordinal ) 117
  • 118. Characteristics and guidelines of table construction Characteristics • Table must be explanatory • Title should describe the content of the table and should answer the question what? Where? And when? It was collected • Percentages in each category should add up to 100 • Foot notes should be placed at the bottom of the table 118
  • 119. Guidelines • The shape and size of the table should contain the required number of raw and Columns to accommodate the whole data • If a quantity is zero, it should be entered as zero, and leaving blank space or putting dash in place of zero is confusing and undesirable • In case two or more figures are the same, ditto marks should not be used in a table in the place of the original numerals • If any figures in a table has to be specified for a particular purpose, it should be marked with asterisk 119
  • 120. 2.3. Diagrammatic Representation 2.3.1. Importance of diagrammatic representation: 1.Diagrams have greater attraction than mere figures. They give delight to the eye, add a spark of interest and as such catch the attention as much as the figures dispel it. 2.They help in deriving the required information in less time and without any mental strain. 3.They have great memorizing value than mere figures. This is so because the impression left by the diagram is of a lasting nature. 4.They facilitate comparison 120
  • 121. Importance…. Well designed graphs can be an incredibly powerful means of communicating a great deal of information using visual techniques When graphs are poorly designed, they not only do not effectively convey your message, they often mislead and confuse. 121
  • 122. 2.3.2.Types 1. Bar graph •Bar diagram is the easiest and most adaptable general purpose chart. •Though this type of chart can be used for any type of series, it is especially satisfactory for nominal and ordinal data. •The categories are represented on the base line (X-axis) at regular interval and the corresponding values of frequencies or relative frequencies represented on the Y-axis (ordinate) in the case of vertical bar diagram and vis-versa in the case of horizontal bar diagram. 122
  • 123. Method of constructing bar graph •All bars drawn in any single study should be of the same width •The different bars should be separated by equal distances •All the bars should rest on the same line called the base •It is better to construct a diagram on a graph paper Types of bar graph • 1.Simple bar graph: It is one-dimensional diagram in which the bar represents the whole of the magnitude. The height/length of each bar indicates the frequency of the figure represented. Example: Construct a bar graph for the following data 123
  • 124. Table__, Distribution of pediatric patients in X hospital ward by type of admitting diagnosis Jan, 2000 Diagnosis Number of patients Relative freq (%) Pneumonia 487 48.7 Malaria 200 20 Cardiac problems 168 16.8 Malnutrition 80 8.0 Others 65 6.5 Total 1000 100 124
  • 125. 1. Simple bar graph… . 125
  • 126. 2.Sub-divided (component) bar graph   • It is also called segmented bar graph. If a given magnitude can be split up into subdivisions, or if there are different quantities forming the subdivisions of the totals, simple bars may be subdivided in the ratio of the various subdivisions to exhibit the relationship of the parts to the whole. • The order in which the components are shown in a "bar" is followed in all bars used in the diagram. 126
  • 128. 3. Multiple bar graph Multiple Bar diagrams can be used to represent the relationships among more than two variables. The following figure shows the relationship between children’s reports of breathlessness and cigarette smoking by themselves and their parents. 128
  • 129. 3. Multiple bar graph… 129
  • 130. 3. Multiple bar graph… • We can see from the graph quickly that the prevalence of the system increases both with the child's smoking and with that of their parents. 130
  • 131. 2. Pie chart Pie chart shows the relative frequency for each category by dividing a circle into sectors, the angles of which are proportional to the relative frequency. Steps to construct a pie-chart  Construct a frequency table  Change the frequency into percentage (P)  Change the percentages into degrees, where: degree = Percentage X 360o  Draw a circle and divide it accordingly 131
  • 132. 2. Pie chart… Example: Distribution of death for females, in England and Wales, 1989. Cause of death Number (%)of deaths Circulatory system (C) 100,000 • Neoplasm (N) -- 70,000 Respiratory system(R) 30,000 Injury & poisoning (I) 6,000 Digestive system (D) 10,000 Others (O) 20,000 Total 236,000 132
  • 134. 3.Histogram Histograms are frequency distributions with continuous class interval that have been turned into graphs. To construct a histogram, we draw the interval boundaries on a horizontal line and the frequencies on a vertical line. Non-overlapping intervals that cover all of the data values must be used. Bars are then drawn over the intervals in such a way that the areas of the bars are all proportional in the same way to their interval frequencies. 134
  • 135. Example: Distribution of the RBC cholinesterase values (µmol/min/ml) obtained from 35 workers Exposed to Pesticides eg. RBC cholinesterase (µmol/min/ml) Frequency, n (%) Cumulative frequency (%) 5.95-7.95 1(2.9) 2.9 7.95-9.95 8(22.9) 25.8 9.95-11.95 14(40) 65.8 11.95-13.95 9(25.7) 91.5 13.95-15.95 2(5.7) 97.2 15.95-17.95 1(2.9) 100 Total 35(100) Source: Knapp RG, Miller MC III: Clinical Epidemiology and biostatistics 135
  • 136. 3.Histogram… Histogram of the RBC cholinesterase values of 35 • . Number of pesticide exposed workers pesticide exposed workers 16 14 12 10 8 6 4 2 0 6.95 8.95 10.95 12.95 14.95 16.95 RBC choilinesterase(umol/min/ml) 136
  • 137. 4.Frequency polygon A frequency distribution can be portrayed graphically in yet another way by means of a frequency polygon. •To draw a frequency polygon we connect the mid-point of the tops of the cells of the histogram by a straight line. •It can be also drawn without erecting rectangles as follows: The scale should be marked in the numerical values of the mid-points of intervals. Erect ordinates on the mid-point of the interval-the length or altitude of an ordinate representing the frequency of the class on whose mid-point it is erected. Join the tops of the ordinates and extend the connecting line to the scale of sizes. 137
  • 139. 5.Cumulative frequency polygon (ogive curve) Some times it may become necessary to know the number of items whose values are more or less than a certain amount. •We may, for example, be interested in knowing the number of patients whose weight is less than 50 Kg or more than say 60 Kg. •To get this information it is necessary to change the form of the frequency distribution from a ‘simple’ to ‘cumulative' distribution. •Ogive curve turns a cumulative frequency distribution in to graphs. 139
  • 140. 5.Cumulative frequency polygon (ogive curve)… Example: Heart rate of patients admitted to Hospital B, 2000 Heart rate No. of patients Cumulative freq., less Cumulative freq., (Beat/min) than method greater than method 54.95-59.5 1 1 54 59.5-64.5 5 6 53 64.5-69.5 3 9 48 69.5-74.5 5 14 45 74.5-79.5 11 25 40 79.5-84.5 16 41 29 84.5-89.5 5 46 13 89.5-94.5 5 51 8 94.5-99.5 2 53 3 99.5-104.5 1 54 1 Total 54 140
  • 141. 5.Cumulative frequency polygon (ogive curve) … 141
  • 142. 6.Box-and-whisker plot It is another way to display information when the objective is to illustrate certain location in the distribution. A box is drawn with the top of the box at the third quartile and the bottom at the first quartile. The location of the midpoint of the distribution is indicated with a horizontal line in the box. Finally, straight lines or whiskers are drawn from the center of the top of the box to the largest observation and from the center of the bottom of the box to the smallest observation. Useful When one of the characteristics is qualitative and the other is quantitative 142
  • 143. Eg: percentage super saturation of bile by sex of patients Men Women Subject Age %Super Subject Age %Super saturation saturation 1 23 40 1 40 65 2 31 86 2 33 86 3 58 11 3 49 76 . 4 5 25 63 86 106 4 5 44 63 89 142 6 43 66 6 27 58 7 67 123 7 23 98 8 48 90 8 56 146 9 29 112 9 41 80 10 26 52 10 30 66 11 64 88 11 38 52 12 55 137 12 23 35 13 31 88 13 35 55 14 20 80 14 50 127 15 23 65 15 47 77 16 43 79 16 36 91 17 27 87 17 74 128 18 63 56 18 53 75 19 59 110 19 41 82 20 53 106 20 25 89 21 66 110 21 57 84 22 48 78 22 42 116 23 27 80 23 49 73 24 32 47 24 60 87 25 62 74 25 23 76 26 36 58 26 48 107 27 29 88 27 44 84 28 27 73 28 37 120 29 65 118 29 57 123 30 42 67 31 60 57 143
  • 145. Box-and-whisker plot • The graphs indicate the similarity of the distribution between the percentage saturation of bile in men and women. •Again, we see that percentage saturation of bile is a bit more spread out among women with range 35 to 146 but we see also that the mid-points of the distributions are almost the same and that most of the spread in values in women occurs in the upper half of the distribution. 145
  • 146. 7.Scatter plot Most studies in medicine involve measuring more than one characteristic, and graphs displaying the relationship between two characteristics are common in the literature. • To illustrate the relationship between two characteristics when both are quantitative variables we use bivariate plots (also called scatter plots or scatter diagrams). A scatter diagram is constructed by drawing X-and Y-axes. •Each observation is represented by a point or dot(•). •In the same study on percentage saturation of bile, information was collected on the age of each patient to see whether a relationship existed between the two measures, the following plot was displayed. 146
  • 147. 7.Scatter plot… The graph suggests the possibility of a positive relationship between age and percentage saturation of bile in women. 147
  • 148. 8.Line graph In this type of graph, we have two variables under consideration like that of scatter diagram. •A variable is taken along X-axis and the other along Y-axis. •The points are plotted and joined by line segments in order. •These graphs depict the trend or variability occurring in the data. •Sometimes two or more graphs are drawn on the same graph paper taking the same scale so that the plotted graphs are comparable. Example: The following graph shows level of zidovudine(AZT) in the blood of AIDS patients at several times after administration of the drug, with normal fat absorption and with fat mal absorption. 148
  • 149. Response to administration of zidovudine in two groups of AIDS patients in hospital X, 1999. 149
  • 151. Measures of central tendency On the scale of values of a variable there is a certain stage at which the largest number of items tend to cluster. Since this stage is usually in the centre of distribution, the tendency of the statistical data to get concentrated at certain values is called “central tendency” The various methods of determining the actual value at which the data tends to concentrate are called measures of central tendency. 151
  • 152. Measures of central tendency… The most important objective of calculating measure of central tendency is to determine a single figure which may be used to represent a whole series involving magnitude of the same variable. In that sense it is an even more compact description of the statistical data than the frequency distribution. •Since a measure of central tendency represents the entire data, it facilitates comparison with in one group or between groups of data. 152
  • 153. Measures of central tendency… Characteristics of a good measure of central tendency A measure of central tendency is good or satisfactory if it possesses the following characteristics. 1.It should be based on all the observations 2.It should not be affected by the extreme values 3.It should be as close to the maximum number of values as possible 4.It should have a definite value 5.It should not be subjected to complicated and tedious calculations 6.It should be capable of further algebraic treatment 7.It should be stable with regard to sampling 153
  • 154. Arithmetic mean (x) The most familiar MCT is the AM. It is also popularly known as average. a) Ungrouped data If x1.,x2., ..., xn are n observed values, Then: 154
  • 155. Arithmetic mean… b) Grouped data .In calculating the mean from grouped data, we assume that all values falling into a particular class interval are located at the mid-point of the interval. It is calculated as follow: where, k = the number of class intervals mi = the mid-point of the ith class interval fi = the frequency of the ith class interval 155
  • 156. Arithmetic mean… Example. Mean = 2630/100 = 26.3 156
  • 157. Arithmetic mean… • The arithmetic mean possesses the following properties. • Uniqueness: For given set of data there is one and only one arithmetic mean. • Simplicity: The arithmetic mean is easily understood and easy to compute. • Center of gravity: Algebraic sum of the deviations of the given values from their arithmetic mean is always zero. • Sensitivity: The arithmetic mean possesses all the characteristics of a central value, except No.2, (is greatly affected by the extreme values). • In case of grouped data if any class interval is open, arithmetic mean can not be calculated 157
  • 158. The Median(X) • a) Ungrouped data •The median of a finite set of values is that value which divides the set of values in to two equal parts such that the number of values greater than the median is equal to the number of values less than the median. •If the number of values is odd, the median will be the middle value when all values have been arranged in order of magnitude. •When the number of observations is even, there is no single middle observation but two middle observations. •In this case the median taken to be the mean of these two middle observations, when all observations have been arranged in the order their magnitude 158
  • 159. The Median… b) Grouped data • In calculating the median from grouped data, we assume that the values within a class-interval are evenly distributed through the interval. • The first step is to locate the class interval in which it is located. We use the following procedure. • Find n/2 and see a class interval with a minimum cumulative frequency which contains n/2. • To find a unique median value, use the following interpolation formal. 159
  • 160. Median… Where,Lm= lower true class boundary of the interval containing the median Fc= cumulative frequency of the interval just above the median class interval fm= frequency of the interval containing the median W= class interval width n = total number of observations 160
  • 161. Median….. Example n/2 = 75/2 = 37.5 Median class interval = 35-44 Lm=34.5 ,Fc= 35, W = 10, n = 75,fm=22 •Median = 34.5 + (37.5-35)/22 x 10 = 35.64 161
  • 162. Properties of the median • There is only one median for a given set of data • The median is easy to calculate • Median is a positional average and hence it is not drastically affected by extreme values • Median can be calculated even in the case of open end intervals • It is not a good representative of data if the number of items is small 162
  • 163. Mode (x) a) Ungrouped data •It is a value which occurs most frequently in a set of values. •If all the values are different there is no mode, on the other hand, a set of values may have more than one mode. b) Grouped data • In designating the mode of grouped data, we usually refer to the modal class, where the modal class is the class interval with the highest frequency. • If a single value for the mode of grouped data must be specified, it is taken as the mid point of the modal class interval. 163
  • 164. Properties of mode • It is not affected by extreme values • It can be calculated for distributions with open end classes • Often its value is not unique • The main drawback of mode is that often it does not exist 164
  • 165. MEASURES OF POSITIONS Quartiles • Divide the distribution into four equal parts. The 25th percentile demarcates the first quartile (Q1), • the median or 50th percentile demarcates the second quartile (Q2), • the 75th percentile demarcates the third quartile (Q3), • and the 100th percentile demarcates the fourth quartile (Q4), which is the maximum observation.   Q1 is the ¼ (n+1)th measurement, i.e, 25% of all the ranked observations are less than Q1. Q2 is 2/4 (n+1)th = (n+1 /2)th measurement. I.e. 50% of all ranked observations are less than Q2. Q2=2 Q1 Q3 is the ¾ (n+1)th observation. Q3= 3 Q1. It indicates that 75% of all the ranked observations are less than Q3.   165
  • 166. Percentile • Is Simply dividing the data into 100 pieces. • value in a set of data that has 100% of the observations at or below it. When we consider it in this way, we call it the 100th percentile. • From this same perspective, the median, which has 50% of the observations at or below it, is the 50th percentile. • The pth percentile of a distribution is the value such that p percent of the observations are less than or equal to it. The pth percentile value depends on whether np/100 is an integer or not: The (k+1) Th largest sample point if np/100 is not an integer where k is the largest integer less than np/100. The average of the (np/100) th and (np/100+1) th largest observation when np/100 is an integer 166
  • 167. Percentiles… Example: The following data is the sample of birth weights (grams) of live births at a hospital during a week period.   3265, 3248, 2838, 3323, 3245, 3101, 2581, 3200, 4146, 2759, 3609, 2069, 3260, 3314, 3541, 3649, 3484, 2834, 2841, 3031. Calculate the 10th and 90th percentiles Solution: n=20; p=0.1 & 0.9 First put the data in ascending order   2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245, 3248, 3260, 3265, 3314,3323,3484,3541,3609,3649,4146.   10th percentile = np/100= 20x0.1=2 which is an integer. So, the 10 th percentile will be the average of the 2nd and the 3rd ordered observation which is 2581+ 2759 divided by two which is equal to 2670 grams. The 90th percentile=np/100= 20x0.9=18 which is an integer. So, the 90 th percentile will be the average of the 18 th and the 19th ordered observation which is 3609+ 3649 divided by two which is equal to 3629 grams. 167
  • 168. Percentiles… • Therefore, we would say that 80 percent of the birth weights would fall between 2607 g and 3629 g, which give us an overall feel for the spread of the distribution.   • The most commonly used percentiles other than the median (50th percentile) are the 25th percentile and the 75th percentile. 168
  • 169. Measures of variability • The measure of central tendency alone is not enough to have a clear idea about the distribution of the data. • Moreover, two or more sets may have the same mean and/or median but they may be quite different. • Thus to have a clear picture of data, one needs to have a measure of dispersion or variability (scatterdness) amongst observations in the set. 169
  • 170. Range (R) R = XL-XS, where • XLis the largest value and XSis the smallest value. • Properties • It is the simplest measure and can be easily understood • It takes into account only two values which causes it to be a poor measure of dispersion 170
  • 171. Interquartilerange (IQR) IQR = Q3-Q1, Where, Q3is the third quartile and Q1is the first quartile. Example: Suppose the first and third quartile for weights of girls 12 months of age are 8.8 Kg and 10.2 Kg respectively. The interrquartile range is therefore, IQR = 10.2 Kg –8.8 Kg, i.e.,50% of infant girls at 12 months weigh between 8.8 and 10.2 Kg. 171
  • 173. Interquartile… • Generally, we use interquartile range to describe variability when we use the median as the measure of central location. We use the standard deviation, which is described in the next section, when we use the mean. Properties • It is a simple and versatile measure • It encloses the central 50% of the observations • It is not based on all observations but only on two specific values • It is important in selecting cut-off points in the formulation of clinical standards • Since it excludes the lowest and highest 25% values, it is not affected by extreme values • It is not capable of further algebraic treatment 173
  • 174. Quartile deviation (QD) Coefficient of quartile deviation (CQD) CQD is an absolute quantity (unit less) and is useful to compare the variability among the middle 50% observations. 174
  • 175. Mean deviation (MD) •Mean deviation is the average of the absolute deviations taken from a central value, generally the mean or median. •Consider a set of n observations x1, x2, ..., xn. Then, Where, A is a central value (arithmetic mean or median). 175
  • 176. Mean deviation … Properties • MD removes one main objection of the earlier measures, that it involves each value • It is not affected much by extreme values • Its main drawback is that algebraic negative signs of the deviations are ignored which is mathematically unsound • MD is minimum when the deviations are taken from median. 176
  • 177. The Variance (σ2, S2) • The main objection of mean deviation, that the negative signs are ignored, is removed by taking the square of the deviations from the mean. • The variance is the average of the squares of the deviations taken from the mean. 177
  • 178. Variance… a)Ungrouped data Let X1, X2, ..., XN be the measurement on N population units, then; 178
  • 179. Variance… The sample variance of the set x1, x2, ..., xn of n observations is: 179
  • 181. Variance… Properties • The main demerit of variance is, that its unit is the square of the unit of measurement of variate values • The variance gives more weightage to the extreme values as compared to those which are near to mean value, because the difference is squared in variance. • The drawbacks of variance are overcome by the standard deviation. 181
  • 182. Standard deviation (σ, S) It is the positive square root of the variance. Properties •Standard deviation is considered to be the best measure of dispersion and is used widely because of the properties of the theoretical normal curve. •There is however one difficulty with it. If the units of measurements of variables of two series is not the same, then there variability can not be compared by comparing the values of standard deviation. Formula sheet for variance and standard deviation.doc Example to calculate variance.doc 182
  • 183. Coefficient of variation • When we desire to compare the variability in two sets of data, the standard deviation which calculates the absolute variation may lead to false results. • The coefficient of variation gives relative variation & is the best measure used to compare the variability in two sets of data. Never use SD to compare variability between groups. • CV = standard deviation Mean 183
  • 184. 4.Basic Probability and probability distributions • Probability is a mathematical technique for predicting outcomes. It predicts how likely it is that specific events will occur. • An understanding of probability is fundamental for quantifying the uncertainty that is inherent in the decision-making process • Probability theory also allows us to draw conclusions about a population of patients based on known information about a sample of patients drawn from that population. 184
  • 185. Basic Probability… • Mutually exclusive events: Events that cannot occur together – For example, event A=“Male” and B=“Pregnant” are two mutually exclusive events (as no males can be pregnant). • Independent events: The presence or absence of one does not alter the chance of the other being present. – one event happens regardless of the other, and its outcome is not related to the other. • Probability: If an event can occur in N mutually exclusive and equally likely ways, and if m of these possess a characteristic E, the probability of the occurrence of E is P(E) = m/N. 185
  • 186. 4.1.Properties of probability 1.A probability value must lie between 0 and 1, 0≤P(E)≤1.  A probability can never be more than 1.0, nor can it be negative • A value 0 means the event can not occur • A value 1 means the event definitely will occur • A value of 0.5 means that the probability that the event will occur is the same as the probability that it will not occur. • Probability is measured on a scale from 0 to 1.0 as shown in in the following Figure of probabilty scale. 186
  • 188. Properties… 2. The sum of the probabilities of all mutually exclusive outcome is equal to 1. P(E1) + P(E2) + .... + P(En) = 1 3. For any two events A and B, P(A or B) = P(A) + P(B) -P(A and B) (Addition rule) For two mutually exclusive events A and B, P(A or B ) = P(A) + P(B). 4. For any two independent events A and B – P(A and B) = P(A) P(B). (Multiplication rule) 188
  • 189. Properties… • To calculate the probability of event (A) and event (B) happening (independent events)for example, if you have two identical packs of cards (pack A and pack B),what is the probability of drawing the ace of spades from both packs? • Formula: P(A) x P(B) P(pack A) = 1 card, from a pack of 52 cards = 1/52 = 0.0192 P(pack B) = 1 card, from a pack of 52 cards = 1/52 = 0.0192 P(A) x P(B) = 0.0192 x 0.0192 = 0.00037 5. If A’ is the complementary event of the event A, Then, P(A’) = 1 -P(A). 189
  • 190. Example • A study investigating the effect of prolonged exposure to bright light on retina damage in premature infants. Eighteen of 21 premature infants, exposed to bright light developed retinopathy, while 21 of 39 premature infants exposed to reduced light level developed retinopathy. For this sample, the probability of developing retinopathy is: P(Retinopathy) = No. of infants with retinopathy Total No. of infants = 18 + 21 = 0.65 21 + 39 190
  • 191. Example… • The following data are the results of electrocardiograms (ECGs) and radionuclide angiocardiograms(RAs) for 19 patients with post-traumatic myocardial contusions. A “+”indicates abnormal results and a “-”indicates normal results. • 1.Calculate the probability of both ECG and RA is abnormal • 2.Calculate the probability that either the ECG or the RA is abnormal 191
  • 192. Example 192
  • 193. Example Solutions 1.P(ECG abnormal and RA abnormal) = 7/19 = 0.37 2.P(ECG abnormal or RA abnormal) = P(ECG abnormal) + P(RA abnormal) –P(Both ECG and RA abnormal) =17/19 + 9/19 –7/19 = 19/19 =1 • NB: We can not calculate the above probability by adding the number of patients with abnormal ECGs to the number of abnormal Ras, I.e. (17+9)/19 = 1.37 • The problem is that the 7 patients whose ECGs and RAs are both abnormal are counted twice 193