Adv.-Statistics-2.pptx

APPROPRIATE
STATISTICAL TOOLS IN
SOCIAL RESEARCH

Advanced Statistics
The course deals with parametric and non-parametric
statistics. It covers the topics on test of association such as
Spearman rho, Phi coefficient, contingency coefficient, biserial;
testing of hypotheses about two independent groups such as
Two independent samples-test Mann-Whitney U, Wilcoxon W;
testing of hypotheses about three or more independent groups
such as one-way ANOVA, Kruskal-Wallis, Jonkheere-Terpstra
test, and testing of hypotheses about repeated measures like
Paired T-test, Sign Test and Chi-square Test of association and
the statistical power analysis. It includes applications and data
analysis with computations carried out using SPSS.

Objectives of the Course
 Familiarize the students with fundamental topics in Statistics such as
descriptive statistics, inferential statistics, parametric statistics, and non-
parametric statistics
 For the students to be able to analyze and interpret sets of
measurements / data by applying any of the topics included in the
course outline.

COURSE OUTLINE
1. Measurements
2. Sampling
3. Summation Notation
4. Frequency Distribution Table
5. Measure of Central Location
6. One-Sample Tests
7. Two-Sample Tests
8. More than Two-Sample Tests
9. Regression and Correlation
10. Chi-Square Test

Statistics
Statistics (as a discipline) is the scientific method of collecting,
organizing, summarizing, presenting and analyzing data for the
purpose of drawing (valid) conclusion(s) and making (reasonable)
recommendations
Statistics is SCIENCE (systematic involving procedures) and ARTS
(how to use e.g. researches)
statistics -- mass of data (as long as data is there, Statistics is there)

There are three kinds of
lies ….
1. LIES!
2. DAMNED LIES!!
and
3. STATISTICS!!!
…. Benjamin Disraeli

Data Gathering
 Objective method. Data are gathered by measurement or
direct observation (e.g. measuring the weight of 1000 heads
of cabbage. These data are classified as primary data
 Subjective method. Data are provided by respondents (e.g.
data on the amount of rice harvest provided by all farmers in
Nueva Vizcaya. These data are classified as secondary data
 Use of existing records. Data are gathered from previously
collected information by some persons or institutions (e.g.
rice yield records in Nueva Vizcaya for the past 20 years
obtained from the Bureau of Agricultural Statistics). These
data are classified as secondary data

Two Phases of Statistics
1. Descriptive statistics. Deals with the methods of collecting,
organizing, summarizing, and presenting and their interpretation.
2. Inferential statistics. Concerned with making generalizations about
a larger set of data where only a part of it (sample) is examined
o Estimation. The objective of estimation is to come up with a
value or a range of values, computed from the sample, and will
be inferred to as the characteristics of the population where the
sample is taken
o Hypothesis Testing. A statistical procedure for testing whether to
accept or reject a hypothesis (about population characteristics)
on the basis of a sample

Levels of Measurements
Measurement is the process of assigning numbers to observations in
such a way that the numbers are amenable to analysis by manipulation
or operation according to certain rules. There are four levels of
measurements: nominal, ordinal, interval, and ratio and each level
determines the appropriate statistical tool or procedure that can be
applied to the set of data in that particular level of measurement (Table
1).

 Nominal level. Values are simple labels or categories or
names without implied ordering or hierarchy in the
labels (e.g. Tax Identification Number or TIN, gender,
civil status, race or color)
 Ordinal level. Values are simply labels with an implied
ordering or hierarchy in the labels. The distance
between two labels, however, is unknown (e.g. sizes of
shirts, job hierarchy, income levels)

 Interval level. Values can be ordered or arranged
according to magnitude or hierarchy; distance between
two values is known; can add / subtract but cannot
multiply / divide; the zero point is arbitrary (e.g.
Intelligence Quotient or IQ, Temperature in 0F or 0C)
 Ratio level. Values have all the properties of the interval
level. In addition, values can be multiplied or divided and
the zero point is fixed (e.g. age, height, area, mass,
length)

Table 1. Four levels of measurements and the statistical tools appropriate for each level
LEVEL DEFINING RELATIONS
EXAMPLE OF
APPROPRIATE STATISTICS
APPROPRIATE
STATISTICAL TEST
Nominal • Equivalence • Mode
• Frequency
• Contingency coefficient
• Non-parametric
Ordinal • Equivalence
• Hierarchy / order
• Median
• Percentile
• Spearman 
• Kendall 
• Kendall 
• Non-parametric
Interval • Equivalence
• Known ratio of two
intervals
• Mean
• Standard deviation
• Pearson r
• Multiple R
• Non-parametric
• Parametric
Ratio • Equivalence
intervals
scale values
• Mean
• Standard deviation
• Pearson r
• Multiple R
• Geometric mean
• Coefficient of variation
• Non-parametric
• Parametric

Universe, Population, Sample and Variables
A researcher would like to study the characteristics of poor
households in the province of Nueva Vizcaya as of December 31, 2022.
Included in the study are the following:
 measurements on annual households income
 household head's highest educational attainment
 employment status (employed, unemployed) of the
household head, and
 household size
Because of time constraints to complete the study, the
researcher obtained measurements on 50 randomly selected poor
households in the province.

Universe  All poor households in the province of Nueva
Vizcaya as of December 31, 2022
Variables  Annual households' income
 Highest educational attainment of household heads
 Employment status of household heads
 Household size
Population  The population for each variable is as follows:
• Annual households' income - poor households
• Highest educational attainment - poor household
heads
• Employment status - poor household heads
• Household size - poor households
Sample  The 50 randomly selected poor households in the
province of Nueva Vizcaya
Definition / specification of terms

Sampling  Inferential statistics involves the process of drawing
out inferences or generalization from the sample
which is the basis in the formulation of conclusions
about the population. The accuracy of these inferences
/ conclusions depends to a large extent upon the
representativeness of the sample. A representative
sample exhibits most, if not all the properties of the
population or, in other words, a representative
sample is a miniature of the population. When
drawing a sample from a population, the two basic
questions that must be addressed are:
• What is the size of the sample?, and
• How is each member of the sample selected?

Sample size  The reasons why a sample is considered in any research undertaking are
the following: (a) to save resources (4Ms – man, materials, machineries
and money; time; and effort), (b) smaller volume of data to deal with,
thus making analysis and interpretation easier, and (c) to overcome the
problem of dealing with members of the population which are
inaccessible.
 In determining the sample size, the following must be considered:
• The bigger the size of the population is, the bigger is the size of the
sample
• Margin of error (sampling error), is the percentage of error incurred
in selecting a sample that is not representative of the population. In
random sampling, choosing a representative sample is attributed to
chance probability. The probability of NOT selecting a
representative sample is known as the margin of error. The lesser the
margin of error is allowed, the more members of the population
should be selected, and the larger the sample size. (In fact, if one does
not allow a margin of error, the whole population should be used).

The sample size is determined using the Slovin’s formula,
n =
N
1+ Ne2
where n is the sample size, N is the population size, and e is
the desired margin of error (decimal)
Sample Size

n =
N
=
5000
= 370
1+ Ne2 1+ 5000(0.05)2
n =
N
=
5000
= 3333
1+ Ne2 1+ 5000(0.01)2
Example 1. With a margin of error of 5%, what is the sample
size, n for a population size of N=5000?
If e = 1%, then

Example 2. During the second semester of SY 2021-2022, the
distribution of enrolment of the College of
Engineering is as follows:
COURSE
GENDER
TOTAL
MALE FEMALE
BSAE 94 54 148
BSCE 79 27 106
TOTAL 173 81 254
Using a 5% margin of error, draw out a sample size employing
Proportional Stratified Random Sampling

Solution:
1. If the grand total (population size) and subtotals are not
given, compute each.
2. Compute the sample size, n with the desired margin of
error.
n =
N
=
254
= 155
1+ Ne2 1+ 254(0.05)2
3. Compute the number for each subgroup.
(a) Male-BSAE
N
=
94
; nM-AE =
n
(94) =
155
(94) = 57
n nM-AE N 254

(b) Female-BSAE
N
=
54
; nF-AE =
n
(54) =
155
(54) = 33
n nF-AE N 254
(c) Male-BSCE
N
=
79
; nM-CE =
n
(79) =
155
(79) = 48
n nM-CE N 254
(d) Female-BSCE
N
=
27
; nF-CE =
n
(27) =
155
(27) = 17
n nF-CE N 254

Cluster Sampling. This method of sampling is convenient to use
when the population is spread over a wide geographic area. In
cluster sampling, groups, not individuals are randomly selected.
Example 3
 The population of all fifth year Agricultural Engineering students
in the country is 600
 For a margin of error of 5%, the desired sample size is 240
 A logical cluster Agricultural Engineering Institutions in the
country. Suppose there are 30 such institutions in the Philippines
with an average population of 20 fifth year agricultural
engineering students.
 The number of clusters (Agricultural Engineering institutions)
needed is 12 (240/20)
 Therefore, 12 Agricultural Engineering institutions will be
randomly selected from the 30 nationwide.
 All the fifth year students in these 12 institutions will be included
in the sample.

Summation Notation
1 2 3
1
...
n
k n
k
a a a a a

    


Terminology
1 2 3
1
...
n
k n
k
a a a a a

    

 The Greek letter, , indicates a sum and is referred to as a
summation operation.
 k is referred to as the index of summation (or summation
variable).
 ak is referred to as the k-th term of the sum
 The numbers 1 and n are the lower and upper limits of the
summation, respectively

Example - Evaluate
Here
The upper and lower limits are 1 and 4
 



4
1
2
3
k
k
k
 
3
2

 k
k
ak
         
10
16
0
4
2
3
4
4
3
3
3
3
2
2
3
1
1
3
4
1
2
2
2
2
2

















k
k
k

Basic Ideas
 As with functions, the letter used to denote
the index of summation is immaterial
 The index of summation need not start at 1
      10
3
3
3
4
1
2
4
1
2
4
1
2





 

 

 i
j
k
i
i
j
j
k
k
  cetera
et
,
5
ln
or
000
,
5
0
20
3

 


j
i
j
i

Why Use Summation Notation
Summation notation allows us to write
mathematical expressions compactly.

Properties for Summation
1 1
n n
k k
k k
ca c a
 

 
1.
2.
3.
1 1 1
( )
n n n
k k k k
k k k
a b a b
  
  
  
1 1 1
( )
n n n
k k k k
k k k
a b a b
  
  
  

Properties (cont’d)
4.
5.
6.
1
n
k
c cn



1
( 1)
2
n
k
n n
k




 
2
1
( 1) 2 1
6
n
k
n n n
k

 



Exercises
Calculate the sums indicated below:
1. .
2. .
3. .
 
 




8
1
1
1
i
i
 



8
1
2
2
m
m
m

 

4
0 1
2
1
2
j j
j

More Exercises
Write the sum using summation notation
4. .
5. .
100
3
2
1 


 
15
2
10
19
12
17
11
15
10






n
n


Still More Exercises
Find the sum
6. .
7. .











000
,
1
1 2
1
1
1
k k
k
 




99
1
1
i
i
i

The Last of the Exercises
8. Re-index the sum in Exercise 2 to run
from 0 to 2
9. Re-index the sum in Exercise 3 to run
from 1 to 5

SAMPLE PROBLEM
One-Sample Test: z-test
A random sample of 50 rice crop plots was found out to
have a mean projected yield of 106.2 cav/ha. Would this
mean that the mean yield is significantly higher than the
observed average yield of 100 cav/ha with a standard
deviation of 11 cav/ha?

ONE-SAMPLE z-test
Solution to problem
1. Ho: µ = 100 cav/ha
Ha: µ > 100 cav/ha
2. Use z-test (n ≥ 30, population standard deviation,  is known)
3. Use  = 5%
4. Value of the test criterion
where  = population standard deviation, 𝒙 =
sample mean; 𝝁 = population mean; and 𝒏
= no. of subjects in the sample
= 3.98
5. ztab = 1.645
6. Since zc > ztab , reject Ho
 The sample yield is significantly higher than the population
yield

How to determine the z tabular value (z-tab)
 For α=0.05 (level of significance: probability of rejecting a true
null hypothesis)
 For one-tailed test:
0.95 (Acceptance
Region)
0.05 (Rejection
Region)
Critical Point
zc=3.980
ztab=z0.05=1.645
 The power of the test (confidence level), β = 1-0.05 = 0.95
 Locate 𝛽 = 0.95 corresponding to the value of 𝑧𝑡𝑎𝑏 (critical point)
from the z-table

z Area under the normal curve, β
1.64 (Y1) 0.9495 (X1)
z0.05 (Y) 0.9500 (X)
1.65 (Y2) 0.9505 (X2)
 Compute for Y (by interpolation)
𝒀 = 𝒀𝟏 + 𝒀𝟐 − 𝒀𝟏
𝑿−𝑿𝟏
𝑿𝟐−𝑿𝟏
𝒀 = 𝟏. 𝟔𝟒 + (𝟏. 𝟔𝟓 − 𝟏. 𝟔𝟒)
𝟎.𝟗𝟓𝟎𝟎−𝟎.𝟗𝟒𝟗𝟓
𝟎.𝟗𝟓𝟎𝟓−𝟎.𝟗𝟒𝟗𝟓
𝒀 = 𝒛𝟎.𝟎𝟓 = 𝟏. 𝟔𝟒𝟓

 For two-tailed test
 The power of the test, β = 1-0.05/2 = 0.9750
 Locate β=0.9750 from the z-table
z Area under the normal curve, β
1.96 0.9750 (Exact, no need for
interpolation)

Alternatively, the following steps can be used in order to arrive at a
decision:
 Inspect the z-table and locate the tabled value for zc ≤ 3.98
 The highest zc in the z-table is zc ≤ 3.09 with a corresponding table
value (𝛽 value or “power of the test”) of 0.9990
 The conditional probability, therefore, is (1–0.9990) = 0.001.
Take note that as you move downwards and to the right of the z-
table, the z-table value is increasing and approaching 1.00, hence
the corresponding conditional probability is less than 0.001 and
decreasing
 Since p=0.05 (critical value or alpha, α) is greater than the
computed conditional probability (actual α), therefore, reject Ho
 Conclusion: The sample yield is significantly higher that the
population yield

SAMPLE PROBLEMS
One-Sample Test: z-test
 The average rating of 140 BSEd graduates in the 2015
Licensure Examination for Teachers of University X is 65.40%.
During the same period, the national average rating is 68.56%
with a standard deviation of 12.66%. Is the claim of the
President justified that his University is performing poorly as
compared to other HEIs offering the same program?
 The mean weight of the baggage carried into an airplane by
individual passengers at Tuguegarao City Airport is 19.8 kg.
An airport authority representative takes a random sample of
110 passengers and obtained a mean weight of 18.5 kg with a
standard deviation of 8.5 kg. Test the claim at 1% level of
significance.

SAMPLE PROBLEM
t-test, one-sample test
The average length of time for students to register for summer
classes at a certain college has been 50 minutes with a standard
deviation of 10 minutes. A new registration procedure using modern
computing machines is being tried. If a random sample of 12 students
had an average registration time of 42 minutes with a standard deviation
of 11.9 minutes under the new system, test the hypothesis that the
population mean is now less than 50 minutes, using a level of
significance of (a) 0.05, and (b) 0.01. Assume the population of times to
be normal.

When the standard deviation of the sample is substituted
for the standard deviation of the population, the statistic does not
have a normal distribution; it has what is called the t‐distribution.
Because there is a different t‐distribution for each sample size, it
is not practical to list a separate area‐of ‐the‐curve table for each
one. Instead, critical t‐values for common alpha levels (0.10,
0.05, 0.01, and so forth) are usually given in a single table for a
range of sample sizes. For very large samples, the t‐distribution
approximates the standard normal (z) distribution. In
practice, it is best to use t‐distributions any time the
population standard deviation is not known.
Values in the t‐table are not actually listed by sample size
but by degrees of freedom (df). The number of degrees of freedom
for a problem involving the t‐distribution for sample size n is
simply n – 1 for a one‐sample mean problem.
Reminders on using t-test …

ONE-SAMPLE t-test
 Solution to problem
1. Ho: µ = 50 minutes
Ha: µ < 50 minutes
2. Use t-test (n < 30 ;  is unknown)
3. Use  = 5%
4. Value of test criterion, tc
where s = sample standard deviation n = sample
size, other terms are as defined earlier
= 2.33
5. Critical region
t (5%, 11) = 2.201 (one tailed)
6. tc > ttab , reject Ho
The true (population mean) mean is less than 50 minutes

Student’s t-Distribution
One
Sided
0.2500 0.2000 0.1500 0.1000 0.0500 0.0250 0.0100 0.0050 0.0025 0.0010 0.0005
Two
Sided
0.5000 0.4000 0.3000 0.2000 0.1000 0.0500 0.0200 0.0100 0.0050 0.0020 0.0010
1 1.000 1.376 1.963 3.078 6.314 12.71 31.82 63.66 127.3 318.3 636.6
2 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 14.09 22.33 31.60
3 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 7.453 10.21 12.92
4 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610
5 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408
8 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041
9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781
10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
11 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437
12 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318
13 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221
14 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140
15 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073
16 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015
17 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965
18 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922
19 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883
20 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850
21 0.686 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819
22 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792
23 0.685 0.858 1.060 1.319 1.714 2.069 2.500 2.807 3.104 3.485 3.767

One
Sided
0.2500 0.2000 0.1500 0.1000 0.0500 0.0250 0.0100 0.0050 0.0025 0.0010 0.0005
Two
Sided
0.5000 0.4000 0.3000 0.2000 0.1000 0.0500 0.0200 0.0100 0.0050 0.0020 0.0010
24 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745
25 0.684 0.856 1.058 1.316 1.708 2.060 2.485 2.787 3.078 3.450 3.725
26 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707
27 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.690
28 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674
29 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659
30 0.683 0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646
40 0.681 0.851 1.050 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551
50 0.679 0.849 1.047 1.299 1.676 2.009 2.403 2.678 2.937 3.261 3.496
60 0.679 0.848 1.045 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460
80 0.678 0.846 1.043 1.292 1.664 1.990 2.374 2.639 2.887 3.195 3.416
100 0.677 0.845 1.042 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390
120 0.677 0.845 1.041 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373
0.674 0.842 1.036 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291
Student’s t-Distribution (Cont’d)

Other Sample Problems …
1. A Little League baseball coach wants to know if his team is
representative of other teams in scoring runs. Nationally, the average
number of runs scored by a Little League Team in a game is 5.70. He
chooses five games at random in which his team scored 5, 9, 4, 11, and 8
runs. Is it likely that his team’s scores could have come from the
national distribution? Assume an alpha level of 0.05. What is the 95%
confidence interval for runs scored per team per game?
2. A professor wants to know if her introductory statistics class has a good
grasp of basic math. Six students are chosen at random from the class
and given a math proficiency test. The professor wants the class to be
able to score above 70 on the test. The six students get scores of 62, 92,
75, 68, 83, and 95. Can the professor have 90 percent confidence that
the mean score for the class on the test would be above 70? (Note:
Included in PS No. 5)

SAMPLE PROBLEM
z-test, two-sample test
In the recently released results of the Licensure Examination for
Teachers (LET), University A with 125 examinees posted a mean
passing percentage of 80.75 while University B with 110 examinees
posted a mean of 86.45. A check with the Professional Regulation
Commission (PRC) revealed that for that year, the standard deviation
for all takers was 11.65. Is the President of University B 99% confident
that his university Education graduates are significantly better than the
Education graduates of University A?

2-SAMPLE Ƶ-test
 Solution to problem
1. Ho: µA = µB
Ha: µA ≠ µB
2. Use Ƶ-test
3. Use  = 1%
4. Value of test criterion
𝑧𝑐𝑜𝑚𝑝 =
𝑥𝐴−𝑥𝐵
𝜎
1
𝑛𝐴
+
1
𝑛𝐵
=
80.75 −86.45
11.65
1
125
+
1
110
= 3.74
5. Critical region
𝑧0.01 = 2.58
6. Since 𝑧𝑐𝑜𝑚𝑝 > 𝑧0.01, reject 𝐻𝑜. ∴ the President of University B is
correct in his claim that his Education graduates are significantly better
than those of University A.

A researcher wants to determine whether or not a given drug has any
effect on the scores of human subjects performing a task of ESP
sensitivity. He randomly assigns his subjects to one of two groups.
Nine hundred subjects in group 1 (the experimental group) receive an
oral administration of the drug prior to testing. In contrast, 1000
subjects in group 2 (control group) receive a placebo. The result of the
ESP sensitivity tests are as follows:
 Mean ESP for group 1 is 9.78 with a standard deviation of 4.05
 Mean ESP for group 2 is 15.10 with a standard deviation of 4.28
Is the drug effective? (Note: lower ESP means less sensitive)

SAMPLE PROBLEM
t-test, two-sample test
An English Professor wishes to see if a literature course changes
regional attitudes. The literature class deals with regional problems. A
regional attitude test is given to 12 students at the beginning (score 1)
and end (score 2) of the semester. The scale is from 20 to 100 with high
score indicating a high degree of regional bias. The results are as
follows:
Is there any significant differences between the two sample means?
Score 1 67 78 91 53 48 56 62 47 28 37 46 52
Score 2 58 69 80 54 32 49 64 40 27 34 39 47

TWO-SAMPLE t-test
 x
where:
= pooled variance
=
𝑛1−1 𝑠1
2+(𝑛2−1)𝑠2
2
(𝑛1+𝑛2−2)
= variance of samples 1 & 2, respectively
= no. of subjects in samples 1 & 2, respectively

 Test of Hypothesis
1. Ho: µ1 = µ2
Ha: µ1 ≠ µ2
2. Test criterion; t-test
3. Level of significance, 5%
4. Computed value of the test criterion
5. Critical region
ttab = t (/2, n1+n2 – 2)
6. Conclude

 Solution to Problem on t-test (two sample test)
(2 t-test)
n1 = 12 n2 = 12
 Test of Hypothesis
1. Ho: µ1 = µ2
Ha: µ1 ≠ µ2
2. Use t-test
3. Use  = 5%
4. X

5. Critical region
ttab = t (/2, 22) = 2.074
6. Since tc < ttab , accept Ho
The intervention does not have significant impact
on the regional bias of the literature students
𝑠𝑝
2 =
12 − 1 297.90 + (12 − 1)(261.17)
(12 + 12 − 2)
= 279.54
𝑡𝑐 =
55.42 − 49.42
279.54
1
12
+
1
12
= 0.88

SAMPLE PROBLEM
F-test
The data below represent the number of hours
of pain relief provided by five different brands of
headache tablets administered to 25 subjects. The
25 subjects were randomly divided into five groups
and each group was treated with a different brand.
Brand of Tablet
A B C D E
5
4
8
6
3
9
7
8
6
9
3
5
2
3
7
2
3
4
1
4
7
6
9
4
7

Perform the analysis of variance and test the
hypothesis at the 0.05 level of significance that the
mean number of hours of relief provided by the
tablets is the same for all five brands.
Three or more sample test
 F-test
Source of
Variation
Degree of
Freedom
Sum of
Squares
Mean
Squares
Fc
Ftab
5% 1%
Column 4 79.44 19.86 6.90** 2.87 4.43
Error 20 57.60 2.88
TOTAL 24 137.04
** significant at 1% level

Working Equations
cdf = p-1 ; p = no. of columns /group
=5 – 1 = 4
Tdf = pr-1 ; r = no. of subjects / group
= (5)(5) -1
= 24
Edf = Tdf – cdf
= 24 – 4
= 20

Correction Factor, CF
TSS = (52+42+… + 72) – CF= 834 - 696.96
= 137.04
CSS = (262 + 392 + 202 + 142 + 332)/5 – CF
= 776.40 – 696.96
= 79.44
ESS = TSS – CSS
= 137.04 – 79.44
= 57.60

CMS = CSS/Cdf = 79.44/4 = 19.86
EMS = ESS/Edf = 57.60/20 = 2.88
Fc = CMS/EMS
= 6.90
F (5%, 4,20) = 2.87 = F0.05
F (1%, 4,20) = 4.43 = F0.01
Fc > F0.01
 The 5 brands are significantly different at
1% level of significance
=
19.86
2.88

Very Clean advertises that its detergent will remove all stains, except
oil-based paint, in any kind of water. Consumer Action is evaluating
this claim. Batches of washing were run in randomly chosen homes
having a particular type of water – hard, moderate, or soft. Each
batch contains an assortment of rags and cloth scraps stained with
food products, grease, and dirt over a 150 square inch area. After
washing the number of square inches that were still stained was
determines and the following results were obtained:
Observation
Type of Water
Hard Moderate Soft
1
2
3
4
5
6
4
3
9
7
5
6
9
4
3
5
0
2
4
3
At 5% level, should Consumer Action conclude that the type of
water affects the effectiveness of the detergent?
Another Sample Problem: F-test

SAMPLE PROBLEM
Pearson r, Association
Consider the following data taken from three sample barangays in
Iligan City, Lanao del Norte during the NCSO and BAECon
Integrated Survey of Households in the 3rd quarter of 1977 (X - highest
grade completed by household head in years; Y - total family income
for the quarter in pesos).

Household
Number
Highest
Grade, X
Income,
Y
Household
Number
Highest
Grade, X
Income,
Y
1 12 1444 12 8 1440
2 13 1650 13 14 2140
3 13 1200 14 15 3330
4 18 2880 15 8 750
5 8 360 16 10 108
6 10 1965 17 4 150
7 6 744 18 10 240
8 8 2784 19 14 3000
9 10 1940 20 6 400
10 6 2450 21 13 2250
11 6 1290 22 6 100
Is there significant relationship between the family income and the
highest grade obtained by the household head?

PEARSON CORRELATION COEFFICIENT (r)

= cross – product of x & y
CPxy = cross product of x & y
SSy = sum of squares of y
SSy = sum of squares of y
n
Y
X
-
XY
n
n
n 



n
X)
(
-
X
2
2
n
n 


n
Y)
(
-
Y
2
2
n
n 



Value of r Interpretation
0.00 – 0.20 Slight correlation, negligible relationship
0.21 – 0.40 Low correlation, definite but small relationship
0.41 – 0.70 Moderate correlation, substantial relationship
0.71 – 0.90 High correlation, marked relationship
0.91 – 1.00 Very high correlation, very dependable relationship

Student Math Score Physics Score
1 3 6
2 2 4
3 4 4
4 6 7
5 5 5
6 1 3
Sample Problem: Given below are the scores of six
students in Math and Physics
Required:
 Correlation coefficient and its interpretation
 Are the two scores significantly related?

Steps on Test of Hypothesis on Correlation Coefficient, r
1. Ho: ρ = 0; ( X & Y are statistically independent)
Ha: ρ ≠ 0; ( X & Y are statistically dependent)
2. Define the level of significance,
3. Select the test criterion (t-test)
4. Compute the value of the test criterion
5. Define critical region
ttab = t (/2, n-2)
6. Conclude

Solution to the problem on
Pearson r (Association)
n = 22
218
X 

n
2444
X
2


n
91
.
9

X
32615
Y 

n
70896117
Y
2


n
50
.
1482

Y
373084
XY 

n

𝐶𝑃
𝑥𝑦 = 𝑋𝑌 −
𝑋 𝑌
𝑛
= 373084 −
218 32615
22
= 49899
𝑆𝑆𝑥 = 𝑋2 −
( 𝑋2)
𝑛
= 2444 −
218 2
22
= 283.82
𝑆𝑆𝑦 = 𝑌2 −
( 𝑌)2
𝑛
= 70896117 −
32615 2
22
= 22544379.50
𝑟 =
𝐶𝑃
𝑥𝑦
(𝑆𝑆𝑥)(𝑆𝑆𝑦
=
49899
(283.82)(22544379.50)
= 0.62

Test of Hypothesis
2. Use t-test
The annual income and number of years of schooling
of family heads are significantly related (associated)
1. 𝐻𝑜: ρ = 0
𝐻𝑎: ρ ≠ 0
3. 𝑈𝑠𝑒 α = 0.05
4. 𝑡𝑐 =
𝑟
1−𝑟2
𝑛−2
=
0.62
1−0.622
22−2
= 3.54
5. 𝑡𝑡𝑎𝑏 = 𝑡(0.05
2,22−2) = 2.086
6. 𝑆𝑖𝑛𝑐𝑒 𝑡𝑐 > 𝑡𝑡𝑎𝑏, 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻𝑜

Simple Linear Regression
 Simple regression analysis can be performed between two variables
if the relationship between them is linear for the purpose of
determining their functional relationship in order to predict one on
the basis of the other
 The functional (linear) relationship is of the form:
𝑌 = 𝑎 + 𝑏𝑋
where 𝑌 is the predicted value of Y given the value of X, a is the
intercept, and b is the slope of the regression line
 X is the independent variable and used as the “predictor”. Y is the
variable whose value is to be “predicted” is called the dependent
variable (also called the predictand or criterion variable).

 The intercept a can be calculated using the expression:
o 𝑎 =
( 𝑌)( 𝑋2) −( 𝑋)( 𝑋𝑌)
𝑁( 𝑋2)−( 𝑋)2
 The slope of the regression line b on the other hand can be
calculated using the expression:
o 𝑏 =
𝑁( 𝑋𝑌)−( 𝑋)( 𝑌)
𝑁( 𝑋2)−( 𝑋)2
 In the above formula, all we need to know are 𝑌, N, 𝑋, 𝑋𝑌,
and 𝑋2
.

From the given problem:
𝑎 =
32615 2444 −(218)(373084)
22 2444 − (218)2 =
−1621252
6244
= -259.6496
b =
22 373084 −(218)(32615)
22 2444 − (218)2 =
1097778
6244
= 175.8133
or
𝑏 =
𝐶𝑃𝑥𝑦
𝑆𝑆𝑥
=
49899
283.82
= 175.81
𝑎 = 𝑌 − b𝑋 = 1482.50 − 175.81 9.91 = −259.78
Therefore, the equation of the regression line is given by::
𝑌 = −259.78 + 175.81X

Household
Number, i
Highest
Grade, X
Income, Y
(Pesos)
1 12 1444 1850.1100 -406.1100
2 13 1650 2025.9233 -375.9233
3 13 1200 2025.9233 -825.9233
4 18 2880 2904.9898 -24.9898
5 8 360 1146.8568 -786.8568
6 10 1965 1498.4834 466.5166
7 6 744 795.2302 -51.2302
8 8 2784 1146.8568 1637.1432
9 10 1940 1498.4834 441.5166
10 6 2450 795.2302 1654.7698
11 6 1290 795.2302 494.7698
12 8 1440 1146.8568 293.1432
13 14 2140 2201.7366 -61.7366
14 15 3330 2377.5499 952.4501
15 8 750 1146.8568 -396.8568
16 10 108 1498.4834 -1390.4834
17 4 150 443.6036 -293.6036
18 10 240 1498.4834 -1258.4834
19 14 3000 2201.7366 798.2634
20 6 400 795.2302 -395.2302
21 13 2250 2025.9233 224.0767
22 6 100 795.2302 -695.2302
𝑌 = −259.6496 + 175.8133(𝑋) 𝜖𝑖 = 𝑌𝑖 − 𝑌
𝜖𝑖 =−0.0082
≈ 0.00

Sample Problem: Spearman 
An English Professor wishes to see if a literature course
changes regional attitudes. The literature class deals with
regional problems. A regional attitude test is given to 12
students at the beginning (score 1) and end (score 2) of the
semester. The scale is from 20 to 100 with high score
indicating a high degree of regional bias. The results are as
tabulated below:
Determine if there is significant association of the two
ranked scores.
Score 1 67 78 91 53 48 56 62 47 28 37 46 52
Score 2 58 69 80 54 32 49 64 40 27 34 39 47

STUDENT
NO.
SCORE 1
(X1i)
RANK
(Rx1i)
SCORE 2
(X2i)
RANK
(Rx2i)
di di
2
1 67 10 58 9 1 1
2 78 11 69 11 0 0
3 91 12 80 12 0 0
4 53 07 54 08 -1 1
5 48 05 32 02 3 9
6 56 08 49 07 1 1
7 62 09 64 10 -1 1
8 47 04 40 05 -1 1
9 28 01 27 01 0 0
10 37 02 34 03 -1 1
11 46 03 39 04 -1 1
12 52 06 47 06 0 0
di = Rx1i - Rx2i
SPEARMAN RHO (rs)
Solution to Problem on Spearman Rho

2. Use t-test (n > 10)
3. 𝑈𝑠𝑒 ∝ = 0.05
2. X
= 8.71
rs = spearman rho correlation
coefficient
1. 𝐻𝑜: 𝜌 = 0
𝐻𝑎: 𝜌 ≠ 0
4. 𝑡𝑐 = 𝑟𝑠
𝑛−2
1−𝑟𝑠
2
𝑡𝑐 = 0.94
12 − 2
1 − 0.942

Spearman …
5. Critical region
ttab = t(5%/2, 10) = 2.228 (two tailed)
6. tc > ttab , reject Ho
 There is significant association between the two ranked scores,
meaning the intervention is significantly effective in reducing
regional bias

Adv.-Statistics-2.pptx

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Adv.-Statistics-2.pptx

Semelhante a Adv.-Statistics-2.pptx (20)

Mais de marissacasarenoalmue

Mais de marissacasarenoalmue (18)

Último

Último (20)

Adv.-Statistics-2.pptx