Inferential Statistics & Regression

Statistical Inference
and
Linear Regression
Reference: Paulsen, Kurt. (2005). Planning Methods. Temple University.

Overview
• INFERENTIAL statistics is the branch of statistics that allow us
to draw conclusions about the data or to test hypothesizes.
“Statistical inference is the act of reaching conclusions
about the world based on a set of data, and then
evaluating the reliability of those conclusions.”
! ! สถิติอางอิงเปนวิธีการในการแกปญหาที่จะพยายาม อาง (infer)
คุณสมบัติของกลุมขอมูลจากกลุมตัวอยางไปยังประชากร
! ! จุดหมายของสถิติอางอิงมีจุดหมายเพื่อทำนายหรือประมาณ
ลักษณะของประชากรจากขอมูลลักษณะของกลุมตัวอยาง และ
ประเมินความนาเชื่อถือของผล

กระบวนการทางสถิติ
คาพารามีเตอร
(μ , σ2 , ρ , ฯลฯ)
กลุมประชากร กลุมตัวอยาง
คาสถิติ
( X , S2 , r , ฯลฯ)การประมาณคา
การทดสอบสมติฐาน
สถิติอางอิง
สถิติเชิงบรรยาย

Z - test
T – test
F – test
χ2 test
rxy
ทดสอบคาเฉลี่ยของกลุมตัวอยางเปรียบเทียบกับกลุม
ประชากรเมื่อกลุมตัวอยางมีขนาดเกิน 30
เปรียบเทียบคาเฉลี่ยของกลุมตัวอยาง 2 กลุม เมื่อกลุม
ตัวอยางมีขนาดไมเกิน 30
เปรียบเทียบคาเฉลี่ยของกลุมตัวอยาง 3 กลุมขึ้นไป
ทดสอบความเปนอิสระตอกัน ระหวางคุณลักษณะของขอมูล
ที่เปนจำนวนหรือความถี่
ทดสอบความสัมพันธระหวางคุณลักษณะของขอมูลที่เปน
ระดับชวงหรือสัดสวน
สถิติอางอิงที่ที่ใชในการหาขอสรุปจากกลุมตัวอยาง

Conﬁdence Intervals
• Central Limit Theorem:
• หมายถึง กลุมตัวอยางจะกระจายตัวอยูจาก sample mean ในรูปของโคง
ปกติ Normal curve ("N") ที่มีคา mean เทากับคา true mean (mu)
และ standard deviation เทากับ “standard error” (sigma หาร
ดวย square root of n, โดย n เปนจำนวนกลุมตัวอยาง (sample size))
ดังนั้น standard deviation ของการกระจายตัวของกลุมตัวอยางเปน
ตัวแปรจาก sample size.

• เราสามารถใชสูตรคำนวณการกระจายตัวของกลุมตัวอยางเพื่อแสดงการกระจายตัวของ
กลุมตัวอยางใดๆ ไดดวย คาที่เรียกวา STANDARD NORMAL:
• คานี้มักเรียกกันวา "z-test statistic" (หรือบางครั้งก็เรียก "z-score") คา z-test
statistic มีคุณลักษณะที่มีการกระจายตัวแบบ STANDARD NORMAL CURVE.
“Standardizing” or z-scores

“Z-values”
• คา “z-value” หรือ “คามาตรฐาน” เปนคะแนนที่แปลงรูปมาจากคะแนนดิบ
เพื่อใหมีความหมายชัดเจนยิ่งขึ้น โดยทั่วไปการแปลงคะแนนดิบใหเปน
คะแนนมาตรฐาน
• เปนการแปลงคะแนนดิบใหเปนคะแนน มาตรฐานโดยอาศัยวิธีการทางสถิติ
และรักษาโคงการแจกแจงเดิมไวไมเปลี่ยนแปลง
• การคานวณหาคะแนนมาตรฐาน Z อาศัยคะแนนเฉลี่ยละคาเบี่ยงเบน
มาตรฐานของ คะแนนแตละชุด โดยใชสูตรดังนี้สูตร
(X-X)/SD
• เมื่อ Z แทน คะแนนมาตรฐานของแตละคน
• X แทนคะแนนดิบของแตละคน
• X แทน คะแนนเฉลี่ยของขอมูลในแตละชั้น
• SD แทน ความเบี่ยงเบนมาตรฐานของคะแนนชุดนั้น

การคำนวณคา “Critical Values”
• By definition: α = 1 - p or p = 1 - α
• Thus if we want to find the central 95 percent of a standard normal curve,
we define p = 0.95 and α = 0.05
• If we define p = 0.95 and hence α = 0.05, we want 95 percent of the
probability to be within our area, and 5 percent to be outside.
• Since a normal curve is symmetrical, having 5 percent of the value in the
tails means having 2.5 percent of the value in each tale. That is, we just
take the value of alpha and divided by 2 for each of the value points.
• จำกัดความโดย: α = 1 - p or p = 1 - α
• ดังนั้นถาเราตองการหาพื้นที่ 95 เปอรเซ็นต ตรงกลางของ standard normal curve เรา
กำหนดคา p = 0.95 และ α = 0.05
• ถากำหนด p = 0.95 จะไดคา α = 0.05 เมื่อเราตองการหาความเปนไปไดที่ 95
เปอรเซ็นตของประชากรที่อยูตรงกลาง โดยมี 5 เปอรเซ็นตอยูนอกขอบเขต.
• จาก normal curve ที่สมมาตร 5 เปอรเซ็นตของประชากรที่อยูนอกขอบเขต จะได 2.5
เปอรเซ็นตที่อยูที่หางของ normal curve แตละดาน.

• เมื่อพิจารณาหางของโคงปกติสองหาง “two-sided” เปนการพิจารณา
percent ของการกระจายตัวระหวางคา 2 คา นั่นเปนการกำหนด
CONFIDENCE INTERVAL.
• ดังนั้นถาเราสนใจคา CONFIDENCE INTERVAL จากคา true mean (μ)
เราสามารถอธิบายดวยสมการดังนี้:
The Probability
that the mean
is between these 2 values
is 1-alpha

Sample Sizes and Conﬁdence Intervals
• KEY POINT: As the sample size increases, the interval (in
which we are p percent “confident” that the true sample
mean lies) gets thinner and thinner. CONFIDENCE INTERVALS
ARE SAMPLE SIZE DEPENDENT!

• ชวงความมั่นใจ (confidence interval) 95 percent หมายถึงอะไร?
หมายถึงถาเราคำนวณคาเฉลี่ยของกลุมตัวอยาง sample average
เปนการประมาณการคา true population mean จำนวน100 ครั้ง เรามี
ความมั่นใจวา 95 ครั้งจาก 100 ครั้ง เราจะสามารถไดคาเดียวกับ true
value of μ (“true” population mean)
• เราใชความรูทางสถิตินี้ในการแสดงความมั่นใจวาเราประมาณคาของ
ประชากรไดอยางมีความแมนยำ โดยไมตองทำการทดสอบกลุมตัวอยาง 100
ครั้ง!

• เพื่อทำใหสมการเรียบงายขึ้น จึงแทนคาสัญลักษณ α ดวยคาที่เราเห็นได ดังนี้
ดวความมั่นใจ 95 percent จึงมี p = 0.95 และ α = 0.05

• และ z-values ที่คาความมั่นใจ 95 percent คือ -1.96 และ 1.96 เราก็จะได
สมการดัวนี้:

Normal Distribution
• โคงระฆังคว่ำ หรือที่เรียกวา Normal Curve หรือ Gaussian
Curve (ตามชื่อของนักวิทยาศาสตรชาวเยอรมัน Karl
Friedrick Gauss, 1777-1855)

Standard Normal Distribution
• การกระจายตัวแบบ Normal มีคาเรียกวา Z value
• เปนการกระจายตัวจากคาเฉลี่ยกลาง ที่
โดยหางจากคากลางตามคา Standard Deviation
เชน คา Z = 1.5 หมายถึง จุดที่หางจากคากลาง 1.5 หนวย
ของ Standard Deviation
• Each Z value is the number of standard deviations
away from the mean.
!
€
µ = 0
!
€
σ =1

Z Value
• หากเราตองการคำนวณความนาจะเปนสำหรับคา Z value คา
หนึ่ง เชน Z = 1.5 ทำไดโดยคำนวณพื้นที่ใตกราฟ
จากตารางคา Z จะได Pr(Z > 1.5) คือ 0.0668

c. Pr(1.0<Z<1.5)
Examples
= 0.1587-0.0668
= 0.0919 = 9%

d. Pr(-1<Z<2)
Examples
= 1-0.1587-0.0228
= 0.818 = 82%

e. Pr(-2<Z<2)
Examples
= 1-0.0228-0.0228
= 0.954 = 95%

• The critical value X = 74 differs from its
mean = 69 and = 3.
Z Value
!
€
Z =
X − µ
σ
!
€
µ !
€
σ
!
Z =
74 − 69
3
=
5
3
=1.67
• Pr(Z>1.67) = 0.47 = 5%

Example
• Suppose the yearling trout in a lake have
lengths that are approximately normally
distributed, about a mean = 9.5” with a
standard deviation = 1.4”. What
proportion of them:
a. Exceeding 12” (the length for keeping a
catch)?
b. Exceeding 10” (the newly proposed legal
length)?
!
€
µ
!
€
σ

!
€
Z =
X − µ
σ
!
€
Z =
12.0 − 9.5
1.4
=
2.5
1.4
=1.79
Thus
Pr(X>12) = Pr(Z>1.79)
= 0.037 = 4%
a.

!
€
Z =
X − µ
σ
Thus
Pr(X>10) = Pr(Z>0.36)
= 0.359 = 36%
b.
!
€
Z =
10.0 − 9.5
1.4
=
0.5
1.4
= 0.36
Z=0.36

CORRELATION and REGRESSION.
• Correlation: correlation measures the strength of
the relationship between variables or the degree
to which two variables are correlated (co-
related). Another way to think of it is that is a
measure of the extent to which two variables
"move together" – as one changes, how does the
other one change? The correlation measure is a
"dimensonless" number, and can therefore be
used to compare "apples" and "oranges" or
variables measured in different units.

• Pearson's Correlation Coefficient วัดความสัมพันธเชิงเสนตรง
ระหวาง 2 ตัวแปร ซึ่งใชไมไดในกรณีที่ตัวแปรมีความสัมพันธกันเปนเสนโคง หรือ
มีจำนวน Outliers มากเปนพิเศษ
• คำสั่งในการหา Pearson's correlation ใน Microsoft Excel คือ
"=CORREL"
•

• ถา | r | มีคามาก หมายถึง x และ y มีความสัมพันธกันมาก
• r = 0 หมายถึง x และ y ไมมีความสัมพันธกัน
• r > 0 หมายถึง x มีคาเพิ่มขึ้น แลว y จะมีคาเพิ่มขึ้น หรือ ถา x
มีคาลดลงแลว y จะมีคาลดลง
• r < 0 หมายถึง x มีคาเพิ่มขึ้น แลว y จะมีคาลดลง หรือ ถา x มี
คาลดลงแลว y จะมีคาเพิ่มขึ้น
• คา b และ r จะมีเครื่องหมายเหมือนกัน

REGRESSION.
• What is a regression? Informally, it is a line fitted between two
variables to estimate the (linear) relationship between the two
variables. In the case where we have more than one "predictor"
variable, it is multi-dimensional plane describing the relationship
between the variables.
• One way to think about regression is that it is a way to test the
statistical effect of one variable on another variable, holding all
other variables constant.
• เปนการหาความสัมพันธเชิงเสนตรงระหวางตัวแปรสองตัว ในกรณีที่มี
ตัวแปรตนมากกวา 1 ตัวแปร ความสัมพันธจะเปนระนาบหลายมิติ
• ความสัมพันธเชิงเสนบอกอิทธิพลเฉพาะของตัวแปรตนตอตัวแปรตาม
เพียงคูเดียว โดยไมคำนึงถึงตัวแปรอื่น (ถือวาคงที่)

เดือน 1 2 3 4 5 6 7 8 9 10 11 12
อุณหภูมิ 18 24 33 37 34 28 32 27 28 27 21 19
ผูชุมนุม 43 38 32 37 5 0 0 0 0 8 23 49
ใชอธิบายความสัมพันธระหวางขอมูล 2 ชุดที่มีอิทธิพลตอกัน (regression) และ
ขอมูล 2 ชุดที่มีความเกี่ยวพันกัน (correlation)
ใช สมการ y = a + bx
โดย Y y = เสนการถดถอย คำนวณไดจากทุกคาของ x ที่กำหนดให Y
Y a = จุดตัดบนแกน y (Intercept)
Y b = ความชันบนเสนกราฟ หรือสัมประสิทธการถดถอย (Regression Coefﬁcient)
ตัวอยางขอมูล
x = ตัวแปรอิสระ (Independent Variable)
y = ตัวแปรตาม (Dependent Variable)

X Variable 1 Line Fit Plot
0
10
20
30
40
50
60
0 10 20 30 40
X Variable 1
Y
Y
Predicted Y

rxy
ทดสอบความสัมพันธระหวางคุณลักษณะ
ของขอมูลที่เปนระดับชวงหรือสัดสวน โดย
คาที่ไดจะบอกไดวาปจจัยที่นำมาเปรียบเทียบ
กันนั้น มีการเปลี่ยนแปลงไปดวยกันหรือไม มี
ทิศทางเดียวกันหรือตรงกันขาม
คาสัมประสิทธิ์สหสัมพันธ (rxy)
เดือน 1 2 3 4 5 6 7 8 9 10 11 12
ผูชุมนุม 43 38 32 37 5 0 0 0 0 8 23 49
อุณหภูมิ 18 24 33 37 34 28 32 27 28 27 21 19
r = -0.40
แสดงวาขอมูลจำนวนผูชุมนุมที่สำรวจมีความสัมพันธกับอุณหภูมิคอนขาง
นอย และมีทิศทางตรงขามกัน

Inferential Statistics & Regression

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Inferential Statistics & Regression

Similar to Inferential Statistics & Regression (14)

More from Thana Chirapiwat

More from Thana Chirapiwat (15)

Inferential Statistics & Regression