Stat2013

Sucheta Tripathy, IICB, November – December 2013

Chi square test
 Sucheta Tripathy, Biostatistics course

work IICB,Nov, 2013

Definitions
•Model or Hypothesis
•Null Hypothesis
•There is no significant difference between the 2.
TRU
E
FALS
E

•Goodness of fit

What you need
 A probability value
 Degree of freedom
 A contingent table
Determine if the deviation is due to chance

Accept

10%
Reject

Example 1
 Mendellian law of dominance

Aa X Aa

A
a

A -> Tall (Dominant)
a -> Dwarf
(recessive)
Aa is …….

a X A

AA Aa Aa aa

639 Tall and
281 dwarf
Chi square requires that you have numeric
values
Chi square should not be calculated if the
expected value is less than 5

Choosing a Test
 First Check if there is a Hypothesis to check
 If yes, then decide which one
 If No, then there is NO statistical test for that.


What is there.

Parametric tests have data that comes in a standard probability distribution.
Non-parametric studies can be used for both normally and not-normally
distributed data:
Question: Then why not to use them always?
Parametric tests make a lot of assumptions: If the assumptions are correct, the
results are more accurate.

Example
Number of sixes
Rolls
0
1
2
3

Number of
48
35
15
03

p1 = P(roll 0 sixes) = P(X=0) = 0.58 P(k out of n) =
p2 = P(roll 1 six) = P(X=1) = 0.345
p3 = P(roll 2 sixes) = P(X=2) = 0.07
p4 = P(roll 3 sixes) = P(X=3) =
0.005

n!
k!(n-k)!

http://www.mathsisfun.com/data/binomial-distribution.html

pk(1-p)(n-k)

Parametric
• Two samples –
compare mean value
for some variable of
interest

Nonparametric

t-test for
independent
samples

Wald-Wolfowitz
runs test
Mann-Whitney
U test
KolmogorovSmirnov two
sample test

• Compare two variables
measured in the same
sample

Parametric
t-test for
dependent
samples

• If more than two variables Repeated
are measured in same
measures
sample
ANOVA

Nonparametric
Sign test
Wilcoxon’s
matched pairs
test
Friedman’s two
way analysis of
variance
Cochran Q

Null Hypothesis
 Coined by English Geneticist Ronald Fischer in 1935.
 At a given probability can either be true or false

Comparing
Populations/datasets:
Population A and Population B
Null hypothesis is true -> No
significant difference between the
populations.
Null Hypothesis is false ->
Significant difference between
populations

There are formula to calculate the
value of a population comparison:
There are look up tables with
values
If calculated value is less than
look up value -> Null hypothesis is
False else True

Null Hypothesis Testing
You have a doubt that whenever it rains your experiment
fails?!!!
NULL Hypothesis is True: No significant difference (Your
experiment failing and raining) -> Rain has nothing to do with
your experiment failing

NULL Hypothesis is false: There is a significant damage to
your experiments when it rains -> Rain ruins your experiment!!

Record when your experiment fails and check if it rains during
that time. It may be so that it happens by chance or it may be so
that there indeed is a relationship.

Lab study vs statistics research
 http://www.youtube.com/watch?feature=player_embe

dded&v=PbODigCZqL8

T-test
 The t-statistic was introduced in 1908 by William Sealy

Gosset
 Used in a normally distributed population

http://www.socialresearchmethods.net/kb/stat_t.php

T-test

Sqrt((Sum(D^2) – (sum(D))^2/n)/n-1)

ANOVA: F statistics
 Analysis of variance
One Way
Two way

2

2

Between Groups
Within groups

Cancel out between variation with
group variation

So How big is F?
Since F is
Mean Square Between / Mean Square Within
= MSG / MSE

A large value of F indicates relatively more
difference between groups than within groups
(evidence against H0)

To get the P-value, we compare to F(I-1,n-I)-distribution
• I-1 degrees of freedom in numerator (# groups -1)
• n - I degrees of freedom in denominator (rest of df)

Connections between SST, MST, and
standard deviation
If ignore the groups for a moment and just compute the standard deviation of
the entire data set, we see

s

2

x ij
n 1

x

2

SST
DFT

MST

So SST = (n -1) s2, and MST = s2. That is, SST and MST measure the TOTAL
variation in the data set.

SST: Sum of squares of Treatment
MST: Mean square of treatment
DFT: Degree of freedom of treatment

Connections between SSE, MSE, and
standard deviation
Remember:

si

xij

xi

ni

2

1

2

SS[ Within Group i ]
dfi

So SS[Within Group i] = (si2) (df i )

This means that we can compute SSE from the standard deviations and sizes
(df) of each group:

SSE

SS[Within]
2
i

s ( ni

1)

SS[Within Group i ]
2
i

s (dfi )

Computing ANOVA F statistic
data
group
5.3
1
6.0
1
6.7
1
5.5
2
6.2
2
6.4
2
5.7
2
7.5
3
7.2
3
7.9
3
TOTAL
TOTAL/df

group
mean
6.00
6.00
6.00
5.95
5.95
5.95
5.95
7.53
7.53
7.53

WITHIN
difference:
data - group mean
plain
squared
-0.70
0.490
0.00
0.000
0.70
0.490
-0.45
0.203
0.25
0.063
0.45
0.203
-0.25
0.063
-0.03
0.001
-0.33
0.109
0.37
0.137
1.757
0.25095714

overall mean: 6.44

BETWEEN
difference
group mean - overall mean
plain
squared
-0.4
0.194
-0.4
0.194
-0.4
0.194
-0.5
0.240
-0.5
0.240
-0.5
0.240
-0.5
0.240
1.1
1.188
1.1
1.188
1.1
1.188
5.106
2.55275

F = 2.5528/0.25025 = 10.21575

Validation
 The larger the F value ->

More variation  reject null hypothesis

A

X-x

squ
are

B

X-x

squ
are

C

1

62

72

42

2

81

49

52

3

75

63

31

4

58

68

80

5

67

39

22

6

48

79

71

7

26

40

68

8

36

76

9

45

Mea
n
TMS
TO
MS
MST(between)
and MSE (within)

df1 and df2

X-x

squ
are

In Summary
SST

(x ij

x)

2

2

s (DFT)

obs

SSE

(x ij

xi )

obs

SSG

2

2

si (df i )
groups

(x i
obs

SSE SSG

x)

2

n i (x i

x)

2

groups

SST; MS

SS
; F
DF

MSG
MSE

2
R

Statistic

R2 gives the percent of variance due to between
group variation

R

2

SS[Between ]
SS[Total ]

We will see R2 again when we study regression.

SSG
SST

Where’s the Difference?
Once ANOVA indicates that the groups do not all appear to have the same means,
what do we do?

Analysis of Variance for days
Source
DF
SS
MS
treatmen
2
34.74
17.37
Error
22
59.26
2.69
Total
24
94.00

Level
A
B
P

N
8
8
9

Pooled StDev =

Mean
7.250
8.875
10.111
1.641

StDev
1.669
1.458
1.764

F
6.45

P
0.006

Individual 95% CIs For Mean
Based on Pooled StDev
----------+---------+---------+-----(-------*-------)
(-------*-------)
(------*-------)
----------+---------+---------+-----7.5
9.0
10.5

Clearest difference: P is worse than A (CI’s don’t overlap)

Multiple Comparisons
Once ANOVA indicates that the groups do not all
have the same means, we can compare them two
by two using the 2-sample t test
• We

need to adjust our p-value threshold because we are
doing multiple tests with the same data.
•There are several methods for doing this.
• If we really just want to test the difference between one
pair of treatments, we should set the study up that way.

Tuckey’s Pairwise
Comparisons

Tukey's pairwise comparisons

Family error rate = 0.0500
Individual error rate = 0.0199

95% confidence
Use alpha = 0.0199 for
each test.

Critical value = 3.55

Intervals for (column level mean) - (row level mean)
A
B

-4.863
-0.859

These give 98.01%
CI’s for each pairwise
difference.

-3.685
0.435

P

B

-3.238
0.766

98% CI for A-P is (-0.86,-4.86)

Only P vs A is significant
(both values have same sign)

Tukey’s Method in R
Tukey multiple comparisons of means
95% family-wise confidence level

diff
lwr
upr
B-A 1.6250 -0.43650 3.6865
P-A 2.8611 0.85769 4.8645
P-B 1.2361 -0.76731 3.2395

Independent sample t-test
Number of words recalled
df = (n1-1) + (n2-1) = 18

t

x1 x2
s x1 x2

t ( 0.05,18)
t

19 26
1
2.101

t ( 0.05,18)
 Reject H0

7

T test
 One sample t test
 Unpaired and paired t test
 Same set of subjects over a period of time

 Independent sets of subjects over a period of time.

http://www.youtube.com/watch?v=JlfLnx8sh-o

One tailed and two tailed t-test:
One tailed: Average height of class A is greater than class B
Two tailed: Average height of class A is different from class B

Z- test statistics
 Sample size is large
 Population variance is known

Sample size is small population variance is unknown go
for t-test

Calculation of z value
Z = X - µ / sqrt (variance/n)
Suppose that in a particular geographic region, the mean and standard deviation
of scores on a reading test are 100 points, and 12 points, respectively. Our interest
is in the scores of 55 students in a particular school who received a mean score of
96. We can ask whether this mean score is significantly lower than the regional
mean — that is, are the students in this school comparable to a simple random
sample of 55 students from the region as a whole, or are their scores surprisingly
low?
We begin by calculating the standard error of the mean:

F-tests / Analysis of Variance (ANOVA)
t=

obtained difference between sample means
difference expected by chance (error)

variance (differences) between sample means

F=

variance (differences) expected by chance (error)

Difference between sample means is easy for 2 samples:

(e.g. X1=20, X2=30, difference =10)
but if X3=35 the concept of differences between sample
means gets tricky

Simple ANOVA example
Total variability

Between treatments
variance

Within treatments
variance

----------------------------

--------------------------

Measures differences due to:

Measures differences due to:

1.

Treatment effects

1. Chance

2.

Chance


F=

MSbetween

When treatment has no effect, differences
between groups/treatments are entirely due
to chance. Numerator and denominator will
be similar. F-ratio should have value around
1.00

MSwithin
When the treatment does have an effect then
the between-treatment differences
(numerator) should be larger than chance
(denominator). F-ratio should be noticeably
larger than 1.00

Simple independent samples ANOVA example

F(3, 8) = 9.00, p<0.05
Placebo Drug A

Drug B

Drug C

Mean

1.0

1.0

4.0

6.0

SD

1.73

1.0

1.0

1.73

n

3

3

3

3

There is a difference
somewhere - have to use
post-hoc tests (essentially
t-tests corrected for multiple
comparisons) to examine
further

F Test Anova
 http://www.youtube.com/watch?v=-yQb_ZJnFXw

Non Parametric tests
Non-parametric tests are basically used in order to overcome the underlying
assumption of normality in parametric tests. Quite general assumptions
regarding the population are used in these tests
Read more: Mann-Whitney U-test / Mann-Whitney-Wilcoxon

IT DOES NOT ASSUME THE VARIANCES TO BE EQUAL!!

Mann-Whitney-Wilcoxon (MWW)
or Wilcoxon Rank-Sum Test)
German Gustav Deuchler in 1914 (with a missing term in the variance) and
later independently by Frank Wilcoxon in 1945
This test is based on the idea that the particular pattern exhibited when 'm'
number of X random variables and 'n' number of Y random variables are arranged
together in increasing order of magnitude provides information about the
relationship between their parent populations.

Assumptions:
•Two samples are random and are independent of each other
•Observations are numeric and ordinal(Arranged in ranks)

It is a test of comparison of medians

When to use this?
Test of Normality:
Simple Histogram method

Normal Probability plot

How to Construct a normal probability plot
Data

rank

i

20

1

15

2

26

3

32

4

18

5

28

6

35

7

14

8

26

9

22

10

17

11

i-.5/N(X) Z
Theoritic value(Y)
al value
observed
value
Mean: 38.8

Sd= 11.4

Ranking the values
4.5
5
5.5
6
6
27

A
5
6
7
8
9
N1=5

B
2
1
5
7
3
4
N2=6

Total number of
comparison= 5 X 6
= 30

0
0
0.5
2.5
0
0
3

How to rank?
Less =0
Tie = 0.5
More = 1

Step by step
 Rank the values
 Add the ranks
 Select larger of the two ranks.
 Calculate N1, N2 and Nx and Tx (Nx – Number of

people with higher rank, rank total of larger value)
 Calculate U

U = N1 * N2 + Nx * (Nx + 1)/Tx - Tx

Less is the value -> Reject Null Hypothesis

Calculating U value
 For smaller dataset: U is the count of ranks of smaller

dataset.
 For larger dataset:

U1 = R1 – n1(n1+1)/2
U2 = R2 – n2(n2+1)/2

Kruskal-Wallis test (H Test)
 Non-parametric test
 Equivalent to Anova (F test) in parametric test
 Does not require the distribution to be normal
 Distribution need to be independent
 Used more often when the distribution is un-equal.
 Data is ordinal
Assumes the distribution to have the same shape:
1. If one distribution is skewed to left and other to the right (un-equal variance),
this test will give in-accurate result

Kruskal Wallis Test
Group A

Group B

Group C

27

20

34

2

8

31

4

14

3

18

36

23

7

21

30

9

22

6

Kruskal Wallis Test
 Define Null or alternative Hypothesis
 State probability
 Calculate Degree of Freedom
 Find critical value

 Calculate the test hypothesis
 State result

H0 Accept NULL hypothesis: There is no difference between the samples
H1 Reject NULL hypothesis : There is difference between the samples
> Critical value reject null hypothesis

Rank

Value

Group A

Group B

Group C

1

2

27

20

34

2

3

2

8

31

3

4

4

14

3

4

6

18

36

23

5

7

7

21

30

6

8

9

22

6

7

9

8

14

9

18

Group A

Group B

Group C

14

10

17

10

6

16

11

21

3

8

2

12

22

9

18

13

13

23

5

11

15

14

27

7

12

4

15

30

Total R 39

65

67

16

31

n

6

6

17

34

18

36

Σ

2
Ri
n

20

1

H= 12
N(N+1)

6

-

3 (N+1)

12/18(19) X (39^2/6 +
65^2/6 + 67^2/6)
- 3(18+1) =2.854

Critical value
Reject NULL
hypothesis
(5.99)

Kolmogorov Smirnov test(KS)
 Non-parametric
 Distribution is unknown
 One way and Two way
 One way – Checks the goodness of fit
 Two way - compare the distribution

Goodness of Fit: A Hypothesis (Mendel’s law of dominance)
NULL Hypothesis:
H0 : F(x) = F*(x) for all x
H1: F(x) = F*(x) for at least one value of x

The K-S statistic Dn is defined as:

K-S test

Dn = max [ | Fn(x) - F(x) | ]
where

Dn is known as the K-S distance
n = total number of data points
F(x) = distribution function of the
fitted distribution
Fn(x) = i/n
i = the cumulative rank of the data
point

Group 1

Group 2

Not confident

20

4

Slightly confident

30

27

Somewhat
confident

13

28

Confident

20

18

Very confident

41

47

1. Take Total
2. Find Frequency
3. Calculate cumulative frequency
4. Find difference
5. Get the largest difference
6. Find critical value (1.36/sqrt(sample size))
7. Test goodness of fit
e.g; Our D > crit D (Distribution is unequal) -> reject NULL Hypothesis

Group 1 Freq

Cumul
ative
freq

Group 2

Freq

Cumul
ative
frq

D

Not
confide
nt

20

0.1612

0.1612

4

0.0322

0.0322

0.129

Slightly
confide
nt

30

0.2419

0.403

27

0.2177

0.25

0.153

Somew
hat
confide
nt

13

0.104

0.508

28

0.225

0.47

0.032

Confide
nt

20

0.161

0.669

18

0.145

0.62

0.048

Very
confide
nt

41

0.330

1

47

0.379

1

0

Critical D = 1.36/sqrt(n1+n2/n1*n2) =
Test NULL Hypothesis

Group 1

Group 2

Not confident

20

4

Slightly confident

30

27

Somewhat
confident

13

28

Confident

20

18

Very confident

41

47

1:2:3:2:1

1. Take Total
2. Find Frequency
3. Calculate cumulative frequency
4. Find difference
5. Get the largest difference
6. Find critical value (1.36/sqrt(sample size))
7. Test goodness of fit
e.g; Our D > crit D (Distribution is unequal) -> reject NULL Hypothesis

Methods of Estimation
 Methods of moments
 Maximum likelihood
 Bayesian Estimators
 Markov chain monte carlo…
Why??
Population size is too large
Testing a hypothesis on a set of samples

Probability density function
(pdf ) -> For continuous variables
Probability mass function (pmf)
-> For discrete variables

Parameter space
Set of all
Family of pdf/pmf

Estimator T is unbiased: if the sample
parameter is ……. Population parameter

Estimation Methods
 Data gets 2 or multi dimensional…..

Method of maximum likelihood
 The maximum likelihood estimates of a distribution

type are the values of its parameters that produce the
maximum joint probability density or mass for the
observed data X given the chosen probability model.

Maximum likelihood is more
general, can be applied on any
probability distribution.

The MLE
 Best parameters obtained by maximizing the

probability of the observed samples.
 Has good convergence properties as sample sizes
increase: Estimated value may equal real value with
Large N
 Applications are many: From speech recognition to
natural language processing to computational biology.

Simple MLE: Coin tossing
 Toss a coin:
 Head
 Tail
Flip coin 10 times (n) = H, H, H, T, H, T, T, T, H, T => 1, 1, 1,
0, 1, 0, 0, 0, 1, 0
An appropriate model for getting a head in a single flip is:
If P = 0.6 and
Xi = 0 and Xi =1

The maximum likelihood
Example:
We want to estimate the probability, p, that individuals are
infected with a certain kind of parasite.
Ind.:

Probability of
Infected: observation:

1

1

p

2

0

1-p

3

1

p

4

1

p

5

0

1-p

6

1

p

7

1

p

8

0

1-p

9

0

1-p

10

1

p

The maximum likelihood method
(discrete distribution):
1. Write down the probability of
each observation by using the
model parameters
2. Write down the probability of
all the data

Pr( Data | p)

p 6 (1 p) 4

3. Find the value parameter(s)
that maximize this probability

The maximum likelihood
Example:
We want to estimate the probability, p, that individuals are
infected with a certain kind of parasite.

0

1-p

3

1

p

4

1

p

5

0

1-p

6

1

p

7

1

p

8

0

1-p

9

0

1-p

10

1

p

Pr( Data | p)

p 6 (1 p) 4

- Find the value parameter(s) that
maximize this probability
0.0012

2

L( p )

0.0008

p

0.0004

1

L(p, K, N)

1

Likelihood function:

0.0000

Ind.:

Probability of
Infected: observation:

0.0

0.2

0.4

0.6
p

0.8

1.0

Likelihood Function
x1

n

 L(P|X1…..Xn) = Π F(xi|P)
i=1

1-x1

=P (1-P)
x1 x2

x2

1-x2

xn

1-x1

= P P …P (1-P)
=P

xn

P (1-P) ………P (1-P)
1-x2

(1-P)…..(1-P)

1-xn

x1+x2+x3….+xn
n – (x1+x2…..+xn)
(1-P) n
n

∑ Xi

=P

i=1

∑ Xi

(1-P)

n - i=1

1-xn

Analytically maximum likelihood
can also be found…

By finding the derivative with respect to
P and finding where the slope is 0.

2 log Λ

http://www.ics.uci.edu/~smyth/courses/cs274/papers/MLtutorial.pdf

Recap…
 Get the population type – set the equation.
 Write the loglikelihood function
 Differentiate
 Set the value of differentiation 0.

 Solve the equation to estimate the parameter.

Methods of moments
 Oldest method
 Distribution dependent
 Geometric
 Poisson
 Bernoulii….
 Depends upon PDF

Methods of Moment
 Population moments can be determined by sample

moments.
 Can be robust
 Sample mean can determine population mean and

sample variance can determine population variance.
 Does Not work well when the distribution is
exponential.

Stat2013

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Stat2013

Semelhante a Stat2013 (20)

Mais de Sucheta Tripathy

Mais de Sucheta Tripathy (20)

Último

Último (20)

Stat2013