1. 1
Lesson 1
INTRODUCTION
SPSS (Statistical Product and Service Solution) is the most famous and commonly used
software for statistics measurement and analysis. It provides a lot of tools to help on calculation
statistical parameters on descriptive statistics, representing data into various graph, calculation
on statistical inference and many others tools.
Manual calculation has so many limitations especially when we have a big number of
samples. It may produce the inaccurate calculation which will impact the accuracy of its
interpretation and analysis. Thus, this software will help us a lot to improve the accuracy and
effectiveness.
Besides SPSS there are some other statistic software such as MINITAB, SAS, Stata,
Lisrel, Exel or PSPP. Among them, PSPP is the free software that you can download easily from
internet.
Understanding the software (SPSS , PSPP)
This statistics software consist of 3 main parts ( 3 windows). They are:
1. Data Editor window
It is automatically open when you open the software. It consist two main window, Data
View and Variable View. It is used on the first step input the data.
Variable View: Used to determine the variable and its setting
Data View: Used to input the data
2. 2
2. Output viewer window
This window will automatically pop up after executing data processing instruction on the
software.
How To Define and Input Data
We can use Data editor window to input the data into SPSS or PSPP. The following steps will
guide you how to create data.
1. Determine the variable on the Variable View. Variable View provides some data setting
that need to set up such as:
a. Variable name: can be defined as your own definition. If you don’t define then
software will automatically generate var 00001, var 00002 etc for the variable name.
b. Data Type: Numeric, String, Date, etc. Default data type is numeric
c. Variable Label and its Value : It used when your data is categorical data.
2. After you have done with the data setting, go to Data view to input the value of the
variable that had been created.
3. 3
Lesson2.
Descriptive Statistics
How to present Statistics Description Measurement using SPSS or PSPP?
Study Case 1:
A random sample of 12 joggers was asked to keep track and report the number of miles they ran
last week. The responses are:
5.5 7.2 1.6 22.0 8.7 2.8 5.3 3.4 12.5 18.6 8.3 6.6
a. compute all the three statistics that measure the central tendency
Analyze Descriptive Statistics Descriptive/ Frequency
b. Briefly describe what each statistics tell you
c. Measure all the variability measurement
Analyze Descriptive Statistics Descriptive
d. What is the interpretation?
Study Case 2:
Has the educational level of adults changed over 15 years? To help answer this question the
Bereau of Labour Statistics compiled the following table, which lists the number (1000) of adults
25 years of age and older who are employed. Use graphical technique to present these figure
1992 1995 2000 2004
Less than high school 13418 11972 12486 12513
High school 37910 36692 37699 37790
Some college 27048 30927 33257 34412
College graduate 28113 31149 36619 40418
4. 4
Answer:
Steps:
1. Create the variable and input the data
2. Create Chart to see the difference
Study Case 3
Given below raw data:
Id Num Name Gender Marital St Height Weight DoB
785
756
757
788
793
803
811
856
876
888
Aminah
Imas
Tn. Rafius
Ismet
Esih
Sumiati
Romlah
Dudung
Fernando
Marimar
Female
Female
Male
Male
Female
Female
Female
Male
Male
Female
Married
Married
Married
Married
Widowed
Married
Widowed
Single
Single
Married
147.6
151
162.4
165
158
156.5
152.7
167
170
168
55.5
42
61.4
64.5
60
60.1
57.7
56
60
55
15-Feb-1953
30-Jun-1986
30-Jun-1960
15-Jan-1967
7-May-1950
19-Aug-1950
12 May 1987
16-Sep-1988
17-Oct-1992
17-Dec-1979
a. Input the Data into PSPP/ SPSS
b. Give some Descriptive Measurement (central tendency & variability) of height variable
c. Interpret the standard deviation of Height variable
d. See the proportion of Marital Status by using pie chart
5. 5
Study Case 4
(Xr 04-36). Everyone is familiar with waiting lines or queues. For example, people wait in line at
a supermarket to go through the checkout counter. There are two factors that determine how long
the queue becomes. One is the speed of service. The other is the number of arrivals at the
checkout counter. The mean number of arrivals is an important number, but so is the standard
deviation. Suppose that a consultant for the supermarket counts the number of arrivals per hour
during a sample of 150 hours.
a. Compute the minimum , maximum, mean, standard deviation of the arrival variable
b. Create the Histogram and give comment on the skewness of the distribution
c. If it is assumed to be bell shaped, interpret the standard deviation.
6. 6
Lesson Three.
Correlation and Regression
This lesson studies how to present some correlation parameters (covariance, coefficient of
correlation and coefficient of determination) and how to present the regression line for the spread
of data.
Some equivalence terms:
Covariance Covariance
Coefficient of correlation Pearson Correlation
Coefficient of Determination R square
Study Case 1
1. A Retailer wanted to estimate the monthly fixed and variable selling expenses. As a first step she
collected data from the past 8 months. The total selling expenses (in $ thousands) and the total
sales (in $ thousands) were recorded and listed below
Total Sales Selling
Expenses
20 14
40 16
60 18
50 17
50 18
55 18
60 18
70 20
a. Compute the covariance, coefficient of correlation and the coefficient of determination and
describe what these statistics tell you
Answer: using PSPP
For having the covariance and coefficient of correlation
AnalyzeBivariate Correlation
7. 7
The table above give you the value of Pearson Correlation (coefficient of correlation) which
is 0.97.
Interpretation: the pearson correlation is 0.97 which is really close to positive 1. It means
that the selling expenses and Total Sales variables has very strong linear relationship.
Note: in SPSS software this correlation table also covers covariance as well, but note in
PSSS (pity of us, hiks hiks hiks)
For having the coefficient of determination (R square)
Analyze Linear Regression
8. 8
The table above give you the R square is 0.95. It means that there are around 95 % the
fluctuation of selling expenses can be explained by the fluctuation of the total sales. The
remaining is unexplained
b. Determine the least square line and use it to produce the estimates retailer wants.
Answer using PSPP:
For having Least Square Line
Analyze Linear regression
The table above give you coefficient for your least square line. Based on the table the least
square line is y=0.11x + 11.66, with y is selling expenses and x is the total sales
The retailer wants to estimate the fixed and variable selling expenses using the least square
line:
The fixed selling expenses based on the table is $11.6 (in thousand). It means that the
minimum selling expenses has to be covered is $ 11.6 (in thousand) even though there is no
sales.
The variable selling expenses will be determined by 0.11. It means that for every single total
sales increament will lead you to increament on selling expenses as amount as $ 0.11 (in
thousand).
9. 9
LESSON 4. STATISTICAL INFERENCE for Mean
A. One population
The basic idea of inference for mean of one population is trying to describe the condition of
population mean by using information from sample. One sample t-test is provided in SPSS and PSPP. P-
value is the parameter that need to considered in determining rejection of the Null hypothesis. As long the
p-value is less than the significance level, the Null hypothesis is rejected.
Study Case:
(Xr 12-23) [ Mean analysis for one population]
A diet doctor claims that the average North American is more than 20 pounds
overweight. To test his claim, a random sample of 20 North Americans was weighed, and
the difference between their actual weight and their ideal weight was calculated.
a. Do the data allow us to infer at the 5% significance level that the doctor’s claim is
true?
b. What is the interval estimation for the average of overweight with 95% confidence
interval?
Steps:
1. Input the data in one row (only one population sample)
2. Analyze Compare Means One Sample t-test
3. Input Overweight variable into test variable and put the tested population mean into test value.
Click option and determine your confidence level
10. 10
4. Click Ok, then you will find the below result
One-Sample Test
Test Value = 20
t Df Sig. (2-tailed) Mean Difference
95% Confidence Interval of the
Difference
Lower Upper
Overweight .562 19 .581 .850 -2.31 4.01
Interpretation:
a. The appropriate hypothesis for the above case is:
20:0 H
20:1 H
That is one tail test, thus the p-value is 0.581/2 = 0.2905
05.0
Based on the p-value = 0.2905 which is greater than alpha, It indicates that Null hypothesis is not
rejected. It means that there is no sufficient evidence to support the doctor’s claim.
b. The 95% confidence interval of the overweight is [-2.31 : 4.01]
B. Inference of two independent Sample
The basic idea of inference for two independent population is trying to describe the condition of
mean difference of two independent populations by using information from the samples. Independent
Sample t-test is provided in SPSS and PSPP. P-value is the parameter that need to considered in
determining rejection of the Null hypothesis. As long the p-value is less than the significance level, the
Null hypothesis is rejected.
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
Overweight 20 20.85 6.761 1.512
11. 11
Hypothesis testing in two populations is used when A bussiness analyst or researcher want to observe or
to compare the condition of two population. For example:
1. Compare the expenditures on shoes made in 2000 with those from 2010 in an effort to determine
whether any change occurred over the time.
2. Estimate or test to determine the difference in the market proportion of two companies or the
proportion of the market share of the company in two different regions.
Study Case:
(Xr 13-08) [Mean analysis for two Population]
A men’s softball league is experimenting with a yellow baseball that is easier to see during
nights games. One way to judge the effectiveness is to count the number of errors. In a
preliminary experiment, the yellow baseball was used in 10 games and the traditional white
baseball was used in another 10 games. The number of error in each game was recorded.
a. Can we infer that the there are fewer error on average when the yellow ball is used? (use
α=5%)
b. What is the interval estimation for the mean difference with 95% confidence interval?
Steps:
1. Input the data on the software
2. Create two additional variables. One for combining both data from two sample, and another one
for grouping each of the data based on its sample class. It is needed to be done since the number
of sample from the two independent populations no need to be the same.
12. 12
3. Analyze Compare Means Independent Sample t-test
4. Input the combined data into test variable and the group into define group, and then click
Define group to create the group value.
5. Click ok, then you will below result
Group Statistics
group N Mean Std. Deviation Std. Error Mean
observation Yellow 10 5.10 2.424 .767
White 10 7.30 2.406 .761
13. 13
Independent Samples Test
Levene's Test for
Equality of Variances t-test for Equality of Means
F Sig. t df
Sig. (2-
tailed)
Mean
Differenc
e
Std. Error
Differenc
e
95% Confidence
Interval of the
Difference
Lower Upper
observa
tion
Equal variances
assumed
.001 .974 -2.037 18 .057 -2.200 1.080 -4.469 .069
Equal variances
not assumed
-2.037 17.99
9
.057 -2.200 1.080 -4.469 .069
Interpretation
Hypothesis Set up
The appropriate hypothesis for the above case is:
0: 210 DIFFH
0: 211 DIFFH
05.0
Levene’s Test : this test is used to determine whether equal variance assume or not. If the p-
value (Sig) under Leven Test is greater than the significance level alpha, then equal variance
assumed.
Based on the above result the Levene’s test give sig=0.974 which is greater than the significance
alpha =0.05. It means that equal variance assumed. Thus, we have to use all result based on equal
variance assumed results.
p-value is 0.057/2 =0.0285 (one tail test)
Conclusion: p-value is 0.057/2 =0.0285 which is less than the significance alpha 0.05. Thus, the
null hypothesis is rejected. There is sufficient evidence to support that there are fewer error when
the yellow ball is used.
The 95% confidence interval for the difference of error made by yellow ball and white ball is
[-4.469 : 0.69]
14. 14
C. Paired Sample t-test
Besides one sample t-test and independent sample t-test, SPSS and PSPP also provide
paired t-test. Paired t-test is used when we have paired sample data. Paired sample data is
gathered from one population who had treatment. We want to see the effect of the treatment.
Thus, we measure the condition before and after the treatment. Paired sample also defined as two
dependent samples.
Study Case:
(Xr 13-44) [Mean analysis for two dependent sample (paired sample)]
The president of a large company is in the process of deciding whether to adopt the lunch time
exercise program. The purpose of such program is to improve the health of workers and, in so
doing, reduce medical expenses. To get more information, he instituted an exercise program for
the employee for the office. The president knows that during the winter months medical
expenses are relatively high because of the incidence of colds and flu. Consequently, he decided
to use a match pair design by recording medical expenses for the 12 months before the program
and for the 12 months after the program. The “before” and ‘after” expenses (in thousands of
dollars) are compared on month –to-month basis and shown in the data.
a. Do the data indicate that exercise programs reduce medical expenses (use α = 5%)
b. Estimate with 95% confidence the mean savings produced by exercise programs.
Steps:
1. Input the data into software
2. Analyze Compare Means Paired Sample t-Test
15. 15
3. Put variable After under Var 1 and variable Before under Var2
4. Click ok then you will have below result
Paired Samples Statistics
Mean N Std. Deviation Std. Error Mean
Pair 1 After 43.50 12 18.618 5.375
Before 46.58 12 16.670 4.812
Paired Samples Correlations
N Correlation Sig.
Pair 1 After & Before 12 .950 .000
Paired Samples Test
Paired Differences
t df
Sig. (2-
tailed)Mean
Std.
Deviation
Std. Error
Mean
95% Confidence Interval of
the Difference
Lower Upper
Pair 1 After -
Before
-3.083 5.885 1.699 -6.822 .656 -1.815 11 .097
16. 16
Interpretation:
Hypothesis Set up
The appropriate hypothesis for the above case is:
0:0 DIFFbeforeafterH
0:1 DIFFbeforeafterH
05.0
p-value is 0.097/2 =0.0485 (one tail test)
Conclusion: p-value is 0.097/2 =0.0485 which is less than the significance alpha 0.05. Thus, the
null hypothesis is rejected. There is sufficient evidence to support that there is smaller amount
medical expenses when the lunch time exercise program applied.
The 95% confidence interval for the difference of the medical expenses before and after the
lunch time exercise program is [-6.822 : 0.656].
17. 17
Lesson 5.Chi-Square Goodness-of-Fit Test
Basically Chi square Goodness of Fit Test is used to described the condition of population of
nominal data. In binomial distribution, the nominal variable could assume one of only two
possible values, such as failure or success. This concept then derives inference of two
populations for proportion. Binomial experiment is extended into Multinomial experiment when
the possible output is more than two. Chi Square Goodness of Fit Test is statistical Measurement
which can be used to inference more than two populations.
Study case:
A machine has a record of producing 80% excellent, 17% good, and 3% unacceptable parts.
After extensive repairs, a sample of 200 produced 157 excellent, 42 good, and 1 unacceptable
part. Have the repairs changed the nature of the output of the machine? Use PSPP with α = 0.05.
Steps:
1. Enter the category data into one variable and the observed frequency into another
variable.
Category data: Quality: 1=excellent, 2=Good, 3=Unacceptable
Figure 5.1
2. The data will be weighted by using its frequency : Data Weight CaseWeight Cases
by Observed_freq
18. 18
3. Do the Chi-Square Test
Analyze Nonparametric Test Chi Square
4. The output given is:
19. 19
5. Output analysis
Step 1: Hypotheses
H0: The repairs did not change the nature of the output of the machine.
[i.e., the proportions remained the same (π1 = 0.80, π2 = 0.17, π3 = 0.03)]
Ha: The repairs did change the nature of the output of the machine.
[i.e., the proportions changed after the repairs (at least one πi ≠ πi,0)]
Step 2: Significance Level
α = 0.05
Step 3: Rejection Region
Reject the null hypothesis if p-value ≤ 0.05 = α.
Step 4.1: Calculate Expected Frequencies
Step 4.2: Check Assumptions
According to footnote a (below), all expected frequencies are ≥ 5 (smallest value is 6).
Step 4.3: Test Statistic and P-value
20. 20
Step 5: Decision
Since p-value = 0.0472 ≤ 0.05, we shall reject the null hypothesis.
Step 6: State conclusion in words
At the α = 0.05 level of significance, there is enough evidence to conclude that the
repairs changed the nature of the output of the machine (the proportions are not what they
used to be)
Lesson 6.Chi-Square of a Contingency Table
The Chi-Square test of a contingency table is used to determine whether there is enough
evidence to infer that two nominal variables are related and to infer that differences exist
between two or more populations of nominal variables.
Example:
Suppose we conducted a prospective cohort study to investigate the effect of aspirin on heart
disease. A group of patients who are at risk for a heart attack are randomly assigned to either a
placebo or aspirin. At the end of one year, the number of patients suffering a heart attack is
recorded.
H0: two variable are independent (no effect on medicine taken into having a heart disease)
Ha: two variable are dependent (there is effect on medicine taken into having a heart disease)
Group
Heart Disease
TotalYes (+) No (-)
Placebo
Aspirin
20
15
80
135
100
150
Total 35 215 250
Steps
1. Input the data. Create the variables: Heart_Disease, freq, Factor.
21. 21
2. The data will be weighted by its frequency
3. Analyze Descriptive Statistics crosstab
Put factor in row box and heart disease in the coloumn box based on the contigency table
4. Results:
22. 22
5. Analysis
p-value = 0.03
Chi-square= 4.98
p-value=0.03<alpha=0.05. It means that Null Hypothesis is rejected. There is sufficient
evidence to support that the medicine taken effect of having heart disease.
23. 23
Anyone who has never made a mistake has never tried anything new.
Albert Einstein
Do not worry about your difficulties in Mathematics. I can assure you mine are still greater
Albert Einstein
3 sentence for getting success: know more than other, work more than other,
and expect less than other.
William Shakespeare
REFERENCE
Managerial Statistics Abbreviated, by Keller, South Western Cengage Learning,2009.
Modul Praktikum Metode Statistika, by FMIPA Gadjah Mada University, 2003