Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Spss course session-II
1. Using SPSS for Statistical Analysis
A course for Beginners
by Leo Fernandez
Session II: Describing Data
1 Types of Variables
In statistics, variables describe attributes of the objects being studied. The value of the variable can
'vary' from one entity or sample element to another.
For example, a person's nationality could be a variable if we are studying people. One person
could be "Mexican" and another "Sudanese". Further, if we consider the two entities described
above (a Mexican and a Sudanese), we might also observe some other attributes of these entities.
For example, the Mexican's height could be 5ft 2in and that of the Sudanese, 5ft 10in.
Variables can be grouped under two broad categories: Qualitative vs. Quantitative Variables.
Qualitative: Qualitative variables are also known as "categorical" variables. They describe
attributes of objects by names or labels. A person's religion (e.g Hindu, Muslim, Christian)
or the colour of the person's eyes (e.g., black, brown, blue) are examples of qualitative or
categorical variables.
Quantitative: Quantitative variables are also know as "numeric" variables. They record a
measurable quantity. For example, when we speak of the population of a city, we are
talking about the number of people in the city - a measurable attribute of the city. Therefore,
population would be a quantitative variable.
2. In statistical data analysis variables are of following types:
Table - 01
Type Category Description Example
Nominal Categorical
Indicates membership to collection or
category.
There is no implied ordering.
eg: Nationality:
1 = Australian
2 = British
3 = Canadian
4 = Dane
5 = Other
Ordinal Categorical Indicates a difference, and indicates the
direction of the difference.
The items in the category can be arranged
from
low to high.
Difference between items are not in equal
intervals
eg: Education
1 = No education
2 = Primary School
3 = High School
4 = Graduate
5 = Postgraduate
Interval Numeric
Indicates a difference with direction.
Amount of difference are in equal intervals.
eg: Age
Recorded in whole
years
Ratio Numeric
Indicates a difference with direction.
Amount of difference are in equal intervals
A zero point is defined.
eg: Income
2 SPSS: Reading data into SPSS
The SPSS program has an interface for data entry. We were introduced to that interface in Session
I. When a researcher decides to use SPSS for data analysis, it is more than likely that the data has
already been collected and stored using an office productivity tool like a spreadsheet program.
Data from external sources can be read into SPSS through the following steps:
1. In the SPSS program, navigate to File → Open → Data
A dialogue box will open.
2. In the dialogue box, click on the down arrow against the field named Files of Type.
Choose “Excel (*.xls, *.xlsx, *.xlsm)”
3. Navigate to the folder containing the Excel file that holds your data and select that file. [Use
the file titanic_ex_II.xlsx that was sent to you.]
4. Click “Open”
A dialogue box appears.
5. Make sure the check-box is ticked against the label “Read Variable names from the first
row of data”.
6. Click “OK”
The Excel file is loaded into SPSS.
Click on the Data View tab at the bottom of the screen.
Viola! You see the data just as you did in your spreadsheet program.
3. Click on the Variable View tab at the bottom of the screen.
This screen displays the names of the columns from the imported Excel file and the properties
associated with each column.
You have successfully read an external data source into SPSS.
SPSS can recognize and read data directly from a select list of formats (as can be seen in the
drop-down for File of Type field of the File → Open → Data dialogue box.
Now that we have imported the data into SPSS, you can view the imported the data in the Data
View and Variable View screens.
The Data View screen displays the data in rows and columns (like a spreadsheet). You can scroll
down the screen to verify that all the data has been correctly imported into the appropriate
columns.
The Variable View screen displays the column names and properties of the data contained in each
column.
3 SPSS: Defining Variables
When you examine the imported data closely, you may notice that the column names are cryptic
(or if you had spaces in the column names of the spreadsheet, the spaces are removed and the
column name is a string of concatenated words). SPSS column names cannot contain spaces and
a few other special characters.
In SPSS, column names are called 'variables'.
It is considered good practice to assign descriptive labels to these variables and define their
properties before proceeding with the analysis of the data.
Defining a variable involves giving it a name, specifying its type, the values the variable can take
(e.g., 1, 2, 3), the scale of measurement and so on.
Variable definitions can be done in SPSS any of the following two screens:
1. The Variable View screen
2. Data → Define Variable Properties screen
1. The Variable View screen
The Variable View screen lists the variables (columns) in the data file and the properties
associated with each of those variables:
4. Table - 02
Property Description
Name The name of the variable. Variable names can not contain spaces. To
change a variable's name, double-click on the variable that you wish
to re-name. Type your new variable name.
Type The type of variable. This column refers to how the data is stored, the
number of characters it can contain besides other formatting
information. This is not to be confused with the Type of Variables
discussed at the beginning of Session II.
SPSS recognizes the following types:
Numeric, Comma, Dot, Scientific notation, Date, Dollar, Custom
currency, String and Restricted Numeric (integer with leading zeros)
To change a variable's type, click inside the cell corresponding to the
“Type” column for that variable. A square "..." button will appear; click
on it to open the Variable Type window. Click the option that best
matches the type of variable. Click OK.
Width The number of digits displayed for numerical values or the number of
characters for a string variable.
Decimals The number of digits after a decimal point for each value of the
variable (applicable to numeric variables)
Label A descriptive definition or display name for the variable. The variable
label appears in the output in place of its name (often vriptic)
Example: The variable sibsp might be described by the label
“Number of Siblings or Spouse on board".
Value For coded categorical variables, the value label(s) that should be
associated with each category code. Value labels are useful primarily
for categorical (i.e., nominal or ordinal) variables, especially if they
have been recorded as codes (e.g., 1, 2, 3). It is good practice to give
each value a label so that you (and anyone looking at your data or
results) understands what each value represents.
Example: In the sample dataset, the variable pclass represents the
Passenger Class. The values 1, 2, 3 represent the categories “1st
Class”, “2nd Class” and “3rd Class”, respectively.
Missing The user-defined values that indicate data are missing for a variable
(e.g., -99). Note that this does not affect or eliminate SPSS's default
missing value code ("."). This column merely allows the user to specify
alternative codes for missing values.
Columns The width of each column in the Data View spreadsheet.
Align The alignment of content in the cells of the Data View spreadsheet.
Measure The level of measurement for the variable (e.g., nominal, ordinal, or
scale).
Role The role that a variable will play in your analyses (i.e., independent
5. variable, dependent variable, both independent and dependent). Some
options in SPSS allow you to pre-select variables for particular
analyses based on their defined roles. Any variable that meets the role
requirements will be available for use in such analyses. You can choose
from the following roles for each variable:
Input: The variable will be used as a predictor (independent
variable). This is the default assignment for variables.
Target: The variable will be used as an outcome (dependent
variable).
Both: The variable will be used as both a predictor and an
outcome (independent and dependent variable).
2. Data → Define Variable Properties screen
The Define Variable Properties window is an efficient way of defining many variables at once, or
defining many variables that share the same formatting. Click Data → Define Variable Properties.
Figure - 01
The Define Variable Properties window will open.
Figure - 02
Select the variables you wish to define in the box on the left and click on the blue arrow button.
The selected variables will be moved to the box on the right under the heading 'Variables to
Scan”. The Continue button is now enabled.
6. Click on Continue.
SPSS will scan the selected variables and identify the existing properties associated with those
variables and display them in a screen where you can view and change the properties for each
variable as shown in the following screen.
Figure - 03
On the screen in Figure - 03 you select each variable in turn from the scanned variables list and
enter the properties as described in Table - 02.
When you are done describing all the variables click OK
ADVANCED:
When you have completed defining the properties of all the variables, instead of clicking on the
OK button, you can click on the Paste button. This will open the SPSS Syntax Editor screen
into which all the SPSS commands used to define the variable properties will be pasted.
You can save this syntax into a file for future use. The next time if you have to import your file
again into SPSS, you will not need to go through all the steps shown above to define the
variable properties. You can open the syntax file you save and execute all the commands in it.
The variable properties will be defined.
7. 4 Inspecting the data: Frequency Distributions
Before we get on with the analysis of the data, we need to inspect the data in order to:
spot abnormalities and data entry errors
observe extreme values (example Age could have been entered as 250 in a particular
case)
check if data for each variable is within the defined range
check for missing values
identify variables that can be recoded into groups (e.g. Fare could be recoded into: Low,
Medium and High)
get a general feel about the integrity and suitability of the data for further analysis
A useful first step is to use the SPSS Frequencies command found from the menu.
1. Click on Analyze → Descriptive Statistics → Frequencies
2. Select all the variables in the list (except ones that represent serial number of cases or in
the example data set the “Name of Passenger” variable – because one would expect a
name to be unique to a passenger).
3. Click on the Statistics button
4. In the Frequency statistics window, place a check mark against: Mean, Median, Mode and
any other optional statistic that you may be interested in examining.
5. Click on Continue
6. Click on Close
SPSS opens an Output Window and displays pages of summary statistics and frequency tables
Concept Check:
1) Give 3 examples of Nominal variables in the Titanic dataset.
ANSWER:
3) What is the difference between Nominal and Ordinal variables?
ANSWER:
4) List the variables in the Titanic dataset that:
a) Can be placed on a scale of measurement.
ANSWER:
b) Can be considered Ordinal Variables.
ANSWER:
c) Are strings.
ANSWER:
5) Can .docx files be read into SPSS ?
ANSWER:
8. for all the selected variables.
The summary statistics table gives the mean, median and mode for each variable. The mean is
meaningful only for numeric scale variables like Age and Fare. It also shows the number of
missing cases for each variable.
Inspect the frequency distribution table of each variable.
From the frequency tables, it is easy to spot:
abnormal and extreme values (example Age could have been entered as 250 in a
particular case)
data that is outside the defined range for a variable
number of cases with missing values ( i.e cases which have no data recorded for the
variable)
identify variables that can be recoded into groups (e.g. Fare could be recoded into: Low,
Medium and High)
get a general feel about the integrity and suitability of the data for further analysis
As you would have observed, for variables measured on a scale (like Age and Fare), the
frequency table could be very long because each case is likely to have a unique number.
For scale variables, it is more informative to generate descriptive statistics.
1. Go to Analyze → Descriptive Statistics → Descriptives
2. Select the variables Age and Fare
3. Set the Options for the statistics you wish to see
4. Click OK.
We have used the Frequency distribution here to detect wrongly coded variables, to spot
abnormalities / extreme values in the data.
However the Frequency distribution plays a greater role in statistics. It provides a useful summary
of the data being studied. It is a part of a collection of statistics known as Descriptive Statistics
which are used to describe the data. In particular the frequency distribution gives measures of
central tendency and dispersion, indicating the mean, median and mode and spread of the data for
each variable.
9. Test - 1
Look at the outputs of the Descriptive Statistics and Frequencies command and answer the
following:
1) What is the mean Fare paid by passengers on the Titanic ?
ANSWER:
2) What is the mode of the Fares paid by passengers on the Titanic ?
ANSWER:
3) How many cases in the Titanic dataset do not have Age entered ?
ANSWER:
4) What is the mean Age of passengers on the Titanic ?
ANSWER:
5) What is the median Age of passengers on the Titanic ?
ANSWER:
6) What is the proportion of passengers on the Titanic who survived ?
ANSWER:
7) How many passengers on the Titanic did not pay any fare ?
ANSWER:
10. 5 SPSS: Histograms
While the Frequency distribution displays a table of numbers that summarizes the distribution of
values of each variable, showing how the values are spread from minimum to maximum, the
Histogram provides a graphical representation of the distribution.
In SPSS, histograms are produced from the same menu option that produced frequency tables.
1. Click on Analyze → Descriptive Statistics → Frequencies
2. Select the variables for which you want to produce histograms (select Age and Pclass as
an example)
3. At the bottom of the variable select screen, uncheck the check-box against the label
“Display Frequency Tables”
4. Click on the Charts button
5. Select the radio button Histograms
6. Click on Continue
7. Click on Close
The histogram will be displayed in the currently open SPSS output window.
Figure - 04
11. 6 Correcting and Cleaning Data
The process of inspecting the data through frequency distributions and histograms, often reveal
input errors and other problems with the data. The errors identified in the previous section need to
be corrected before proceeding with analysis.
What are these errors that we are talking about and how do we correct them if we find such errors?
Typical examples of data errors could be:
incorrect coding of values
typing mistakes
shifting of data from one column into the neighboring column
outliers or extreme values
Data cleaning activity typically takes a large chunk of time in data analysis. It is a very important
step nevertheless because erroneous data can lead to erroneous conclusions.
This session will be conducted as a hands-on exercise under supervision, according to the
following instructions.
Lab Exercise: Correcting and Cleaning Data
1. Read the supplied data file: titanic_ex_II.csv
2. Re-run the commands used in Section 4 - Inspecting the data
3. Inspect the outputs produced.
4. Make a list of the errors identified in the outputs.
5. Identify the cases which have these errors.
6. Correct the errors using the data editor.
7. Re-run the commands used in Section 4 to confirm that the errors have been rectified.
8. Save the data file.
Session II: Homework Exercise:
1. Read the data from the file “body.csv” into SPSS. Study the accompanying file “body.txt”
which provides information about the dataset.
2. The article associated with this data set appears in the Journal of Statistics Education,
Volume 11, Number 2 (July 2003). Read this article here:
http://www.amstat.org/publications/jse/v11n2/datasets.heinz.html
3. Once the data has been read into SPSS, assign meaningful variable labels and value
labels, using the information provided in the file “body.txt”.
4. Produce frequency tables, histograms and box plots from this dataset.
12. OR
1. Read the data from the file “cafedata.xls” into SPSS. Study the accompanying file
“cafedata_documentation.txt” which provides information about this dataset.
2. The article associated with this data set appears in the Journal of Statistics Education,
Volume 19, Number 1 (March 2011) issue. Read this article here:
http://www.amstat.org/publications/jse/v19n1/depaolo.pdf
3. Once the data has been read into SPSS, assign meaningful variable labels and value
labels.
4. Produce some frequency tables and histograms.
Online Resources:
1. https://statistics.laerd.com/statistical-guides/types-of-variable.php
2. https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php