Descriptive Statistics.pptx

DESCRIPTIVE
STATISTICS
BADR EDDINE
IBN YAHIA

OUTLINE
• Introduction
• Frequency Distribution
• Measures of Central Tendency
• Measures of Variability
• Describing Interval and Ratio Data (Numerical
Scores)
• Describing Non-numerical Data from Nominal
and Ordinal Scales of Measurement
• Using Graphs to Summarize Data
• Correlations
• Regression
• Multiple Regression

• The general goal of descriptive statistics is to organize or summarize a set of
scores. Two general techniques are used to accomplish this goal.
• 1. Organize the entire set of scores into a table or a graph that allows researchers
(and others) to see the whole set of scores.
• 2. Compute one or two summary values (such as the average) that describe the
entire group.

FREQUENCY DISTRIBUTION
• A frequency distribution is an overview of all distinct values in some variable
and the number of times they occur.
• It consists of a tabulation of the number of individuals in each category on the
scale of measurement:
• 1. The set of categories that make up the scale of measurement.
• 2. The number of individuals with scores in each of the categories.

FREQUENCY DISTRIBUTION
• The advantage of a frequency distribution is that it allows a researcher to view the
entire set of scores. It presents raw data in an organized, easy-to-read format.
• The disadvantage is that constructing a frequency distribution without the aid of
a computer can be somewhat tedious, especially with large sets of data. The
primary drawback of frequency distributions is the loss of detail.

TABLE 15.1 IS A FREQUENCY DISTRIBUTION TABLE SUMMARIZING
THE SCORES FROM A 5-POINT QUIZ GIVEN TO A CLASS OF N 15
STUDENTS.
• In this example, one person had a
perfect score of X=5 on the quiz,
three people had scores of X=4.

• Another example, 183 students fill
out a questionnaire. One of the
questions was which study major
they're following.

• The resulting table shows how
frequencies are distributed over
values -study majors in this example-
and hence is a frequency distribution.

RELATIVE FREQUENCIES
• Optionally, a frequency distribution
may contain relative frequencies:
frequencies relative to (divided by)
the total number of values. Relative
frequencies are often shown as
percentages or proportions.
• Relative frequencies provide easy
insight into frequency distributions.
Besides, they facilitate comparisons.

FREQUENCY DISTRIBUTION GRAPHS
• The graph shows the scale of measurement (set of categories) along the
horizontal axis and the frequencies on the vertical axis.
• When the measurement scale (scores) consists of numerical values (interval or
ratio scale of measurement), there are two options for graphing the frequency
distribution.

• A histogram is a graph that illustrates the relative frequency of a single variable.
• A polygon is a graph constructed by using lines to join the midpoints of each
interval, or bin.

• Figure 15.1a is a traditional histogram
with a bar above each category.
Traditional histogram (a).

• In Figure 15.1b, they modified the
histogram slightly by changing each
bar into a stack of blocks.
• The modification helps emphasize
the concept of a frequency
distribution.
A modified histogram
(b)

• Figure 15.1c presents the same data
in a polygon.
A polygon (c)

• It shows how frequencies are
distributed over values.

• When the categories on the scale of
measurement are nominal or ordinal
scales, the frequency distribution is
presented as a bar graph.
• Also, it is easy to see the extreme
scores that are very different from
the rest of the group.
Bar Graph Showing the Frequency Distribution
of Academic Majors in an Introductory Psychology Class.

• Frequency distributions, especially graphs, can be a very effective method for
presenting information about a set of scores.
• The distribution shows whether the scores are clustered together or spread out
across the scale.
• However, a frequency distribution is generally considered to be a preliminary
method of statistical analysis.

MEASURES OF CENTRAL TENDENCY
• A measure of central tendency is a single value that attempts to describe a set of
data by identifying the central position within that set of data.
• As such, measures of central tendency are sometimes called measures of central
location. They are also classed as summary statistics.
• The goal is to find the average, or the most typical, score for the entire set.

THE MEAN, MEDIAN AND MODE
• The mean, median and mode are all valid measures of central tendency, but
under different conditions, some measures of central tendency become more
appropriate to use than others.

MEAN (ARITHMETIC)
• The mean (or average) can be used with both discrete and continuous data,
although its use is most often with continuous data.
• The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set. (The mean is computed by adding the scores
and dividing the sum by the number of individuals).

• So, if we have n values in a data set and they have values x1,x2, …,xn, the sample
mean, usually denoted by x― (pronounced "x bar") or with the letter M, is:
• x̄= x1,x2, …,xn/n or M= ΣX/n
• To compute the mean, you first find the sum of the scores (represented by ΣX)
and then divide by the number of scores (represented by n).
• Scores: 4, 2, 1, 5, 2, 2, 3, 4, 3, 2, 3, 1
• ΣX=32 and n=12.
• The mean is M=32/12=2.67.

• In statistics, samples and populations have very different meanings and these
differences are very important, even if, in the case of the mean, they are
calculated in the same way.
• To acknowledge that we are calculating the population mean and not the sample
mean, we use the Greek lower case letter "mu", denoted as μ:
• μ= ΣX/n

WHEN NOT TO USE THE MEAN
• The mean has one main disadvantage: it is particularly susceptible to the
influence of outliers. These are values that are unusual compared to the rest of
the data set by being especially small or large in numerical value.
• For example, consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k

• The mean salary for these ten staff is $30.7k.
• The mean is being skewed by the two large salaries. Therefore, in this situation,
we would like to have a better measure of central tendency.
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k

• Mean cannot be calculated for nominal or nonnominal ordinal data (when we are
dealing with qualitative characteristics).
• For example, a researcher may use the value 0 for a male and the value 1 for a
female (nominal measurements are coded with numerical values).
• In this situation, it is possible to compute a mean; however, the result is a
meaningless number.

MEDIAN
• The median is the middle score for a set of data that has been arranged in order
of magnitude.
• The median is the score that divides a distribution in half.

• In order to calculate the median:
• 65 55 89 56 35 14 56 55 87 45 92
• We first need to rearrange that data into order of magnitude (smallest first):
• 14 35 45 55 55 56 56 65 87 89 92
• Our median mark is the middle mark - in this case, 56. This works fine when you
have an odd number of scores.

• Take the middle two scores and average the result. So, if we look at the example
below:
• 65 55 89 56 35 14 56 55 87 45
• We again rearrange that data into order of magnitude (smallest first):
• 14 35 45 55 55 56 56 65 87 89
• Only now we have to take the 5th and 6th score in our data set and average them
to get a median of 55.5.
• The median is 55+56/2=55.5

• In a distribution with a few extreme scores, for example, the extreme values can
displace the mean so that it is not a central value.
• In this situation, the median often provides a better measure of central tendency.
Thus, you can think of the median as a backup measure of central tendency that
is used in situations in which the mean does not work well.

THE MODE
• The mode is the score or category with the greatest frequency.
• The mode is simply the most frequently occurring score.
• Scores: 4, 2, 1, 5, 2, 2, 3, 4, 3, 2, 3, 1
• There are more scores of X=2 than any other value. The mode is 2.
• On a histogram it represents the highest bar in a bar chart or histogram. You can,
therefore, sometimes consider the mode as being the most popular option.

• The mode identifies the location of
the peak (highest point) in the
distribution.

• Normally, the mode is used for
categorical data where we wish to
know which is the most common
category.

TYPES OF MODE
• Example:
• For a data set (3, 7, 3, 9, 9, 3, 5, 1, 8,
5), the unique mode is 3.
• A distribution with a single mode is
said to be unimodal.

• Example:
• Similarly, for a data set (2, 4, 9, 6, 4, 6, 6, 2,
8, 2), there are two modes: 2 and 6.
• A distribution with more than one mode is
said to be bimodal, trimodal, etc., or in
general, multimodal.

• However, one of the problems with
the mode is that it will not provide us
with a very good measure of central
tendency when the most common
mark is far away from the rest of the
data in the data set.

• The mean is a measure of central tendency obtained by adding the individual
scores, then dividing the sum by the number of scores. The mean is the arithmetic
average.
• The median measures central tendency by identifying the score that divides the
distribution in half. If the scores are listed in order, 50% of the individuals have
scores at or below the median.
• The mode measures central tendency by identifying the most frequently
occurring score in the distribution.

SUMMARY OF WHEN TO USE THE MEAN, MEDIAN
AND MODE
Type of Variable
Best measure of central
tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not
skewed)
Mean
Interval/Ratio (skewed) Median
• Goals scored over the last 7 games.
• 1 3 4 6 6 7 8
• Mean (average) 5
• Mode (most common) 6
• Median (middle) 6

MEASURES OF VARIABILITY
• Variability describes the spread of the scores in a distribution.
• When variability is small, it means that the scores are all clustered close together.
• Large variability means that there are big differences between individuals and the
scores are spread across a wide range of values.

• To introduce the idea of variability, consider this example. Two vending machines A and B drop
candies when a quarter is inserted. The number of pieces of candy one gets is random. The
following data are recorded for six trials at each vending machine:
• Vending Machine A Pieces of candy from vending machine A:
• 1, 2, 3, 3, 5, 4
• mean = 3, median = 3, mode = 3
• Vending Machine B Pieces of candy from vending machine B:
• 2, 3, 3, 3, 3, 4
• mean = 3, median = 3, mode = 3

• The dot plot for the pieces of candy
from vending machine A and vending
machine B is displayed:

• There are many ways to describe variability or spread including:
• Range
• Interquartile range (IQR)
• Variance and Standard Deviation

RANGE
• The range is the difference in the maximum and minimum values of a data set.
The maximum is the largest value in the dataset and the minimum is the smallest
value. The range is easy to calculate but it is very much affected by extreme
values.
• Range=Maximum-Minimum
• Goals scored over the last 7 games.
• 1 3 4 6 6 7 8
• Range (largest-smallest) 7

INTERQUARTILE RANGE (IQR)
• Like the range, the IQR is a measure of
variability, but you must find the
quartiles in order to compute its value.
• The interquartile range is the difference
between upper and lower quartiles and
denoted as IQR.
• IQR = Q3 – Q1
• = 75th percentile – 25th
percentile

VARIANCE AND STANDARD DEVIATION
• Standard deviation uses the mean of the
distribution as a reference point and
measures variability by measuring the
distance between each score and the mean.
Conceptually, standard deviation measures
the average distance from the mean.
• When the scores are clustered close to the
mean, the standard deviation is small; when
the scores are scattered widely around the
mean, the standard deviation is large.

• The calculation of standard deviation begins by computing the average squared
distance from the mean. This average squared value is called variance.
• Variance is the average squared distance from the mean and is usually identified
with the symbol s². The calculation of variance involves two steps:

STEP 1:
• Compute the distance from the mean, or the deviation, for each score, then
square each distance, then add the squared distances. The result is called SS, or
the sum of the squared deviations.
• SS= ΣX²-(ΣX)²/n
• X = 5 6 1 5 3 = 20, ΣX=20
• X²= 25 36 1 25 9 = 96, ΣX²=96
• SS (The sum of the squared deviations) =16

STEP 2:
• Variance is obtained by dividing SS (the sum of squared deviations) by n-1.
• SS=16 and n=5
• Variance=s²=SS/n-1=16/4=4
• When we calculate the sample SD we estimate the population mean with the
sample mean, and dividing by (n-1) rather than n which gives it a special property
that we call an "unbiased estimator".
• Therefore s² is an unbiased estimator for the population variance.

STANDARD DEVIATION (SD)
• Approximately the average distance the values of a data set are from the mean or
the square root of the variance.
• SD = √s, SD = √4 = 2
• Standard deviation = √ Variance
• Variance = (Standard deviation)²

• Standard deviation provides a measure of the standard distance from the mean. A
small value for standard deviation indicates that the individual scores are
clustered close to the mean and a large value indicates that the scores are spread
out relatively far from the mean.
• Variance also provides a measure of distance. A small variance indicates that the
scores are clustered close together; a large variance means that the scores are
widely scattered.

DESCRIBING INTERVAL AND RATIO DATA
(NUMERICAL SCORES)
• Figure shows a frequency distribution
graph with the mean and standard
deviation displayed as described.
• As a general rule, roughly 68% of the
scores in a distribution are within one
standard deviation of the mean and
roughly 95% of the scores are within
two standard deviations.

• The mean (M) and standard deviation
are two values that are probably the
most commonly reported descriptive
statistics, and they should provide
enough information to construct a
good picture of the entire set of
scores.
M=45
SD=6

DESCRIBING NON-NUMERICAL DATA FROM NOMINAL
AND ORDINAL SCALES OF MEASUREMENT
• A researcher may simply classify participants by placing them in separate nominal
or ordinal categories.
• Classification of people by gender (male or female).
• Classification of attitude (agree or disagree).
• Classification of self-esteem (high, medium, or low).

• Report the proportion or percentage in each category.
• These values can be used to describe a single sample or to compare separate
samples.
• For example, a report might describe a sample of voters by stating that 43%
prefer candidate Green, 28% prefer candidate Brown, and 29% are undecided.
• A research report might compare two groups by stating that 80% of the 6-year-
old children were able to successfully complete the task, but only 34% of the 4-
year-olds were successful.

• In addition to percentages and proportions, you also can use the mode as a
measure of central tendency for data from a nominal scale.
• For example, if the modal response to a survey question is “no opinion,” you can
probably conclude that the people surveyed do not care much about the issue.

USING GRAPHS TO SUMMARIZE DATA
• For example, a researcher may want
to examine the effects of heat and
humidity on performance.
• For this study, both the temperature
(variable 1) and the humidity
(variable 2) would be manipulated,
and performance would be evaluated
under a variety of different
temperature and humidity
conditions.

• As a general rule, graphs for two-
factor studies are constructed by
listing the values of one of the
independent variables on the
horizontal axis and listing the values
for the dependent variable on the
vertical axis.

• Notice that the top line presents the
means in the top row of the data
matrix and the bottom line shows the
means from the bottom row.
• The result is a graph that displays all
six means from the experiment, and
allows comparison of means and
mean differences.

CORRELATIONS
• A correlation is a statistical value that measures and describes the direction and
degree of relationship between two variables.
• The sample correlation coefficient is typically denoted as r. It is also known as
Pearson’s r.
• r = SP/ √(SS for X)(SS for Y)
• Note that the two variables are labeled X and Y.
• SP is The sum or the products of the deviations.

• For this example, the researcher
computes a correlation that measures
and describes the relationship
between self-esteem and
performance.
Participant Self-Esteem
Scores
Performance
Scores
A 62 13
B 84 20
C 89 22
D 73 16
E 66 11
F 75 18
G 71 14
H 80 21
Two Separate Scores for Each Participant

Participant Self-Esteem
Scores
Performance
Scores
A 62 13
B 84 20
C 89 22
D 73 16
E 66 11
F 75 18
G 71 14
H 80 21
Two Separate Scores for Each Participant
A Scatter Plot Showing the Data

CALCULATION
• x̄ =62+84+89+73+66+75+71+80 =75
• 8
• ȳ =13+20+22+16+11+18+14+21 =16.875
• 8
• Σ(x - x̄)2 = (62-75)2+(84-75)2+(89-75)2+(73-75)2+(66-75)2+(75-75)2+(71-75)2+(80-
75)2 = 572
Σ(y - ȳ)2 = (13-16.88)2+(20-16.88)2+(22-16.88)2+(16-16.88)2+(11-16.88)2+(18-
16.88)2+(14-16.88)2+(21-16.88)2 = 112.875
Σ(x - x̄)(y - ȳ) = (62-75)*(13-16.88)+(84-75)*(20-16.88)+(89-75)*(22-16.88)+(73-
75)*(16-16.88)+(66-75)*(11-16.88)+(75-75)*(18-16.88)+(71-75)*(14-16.88)+(80-
75)*(21-16.88) = 237
• Sxy = Σ(x - x̄)(y - ȳ)
• n – 1
• r= 237 = 0.9327
• √(572*112.875)

• Results of the Pearson correlation indicated that there is a significant large
positive relationship between X self-esteem and Y performance, (r= .933, p <
.001). r = 0.9327.
• The P-value is the probability that you would have found the current result if
the correlation coefficient were in fact zero (null hypothesis). If this
lower than the conventional 5% (P<0.05) the correlation coefficient is called
statistically significant.

PROPERTIES OF THE CORRELATION COEFFICIENT, R
• +1 ≥ r ≥ -1, i.e. r takes values between -1 and +1, inclusive.
• The sign of the correlation provides the direction of the linear relationship. The sign
indicates whether the two variables are positively or negatively related.
• A correlation of 1.00 indicates a perfectly consistent relationship and a correlation of
0.00 indicates no consistent relationship whatsoever.
• There are no units attached to r.
• As the magnitude of r approaches 1, the stronger the linear relationship.
• As the magnitude of r approaches 0, the weaker the linear relationship.
• The correlation value would be the same regardless of which variable we defined as X
and Y

• The following four graphs illustrate
four possible situations for the values
of r.

• The graph (d) which shows a strong
relationship between y and x but
where r = 0. Note that no linear
relationship does not imply no
relationship exists!

SPEARMAN CORRELATION
• The Spearman correlation also referred to as Spearman rank correlation or
Spearman’s “rho”.
• It is typically denoted either with the Greek letter rho (ρ), or rs is simply the
Pearson correlation applied to ordinal data (ranks). If the original scores are
numerical values from an interval or ratio scale, it is possible to rank the scores
and then compute a Spearman correlation.
• rs = SP/ √(SS for X)(SS for Y)
• In this case, the Spearman correlation measures the degree to which the
relationship is consistently one-directional, or monotonic.

REGRESSION
• Linear regression attempts to model the relationship between two variables by
fitting a linear equation to observed data. One variable is considered to be an
explanatory variable, and the other is considered to be a dependent variable.
• A linear regression line has an equation of the form 𝑌 = 𝑎 + 𝑏𝑥, where X is the
explanatory variable and Y is the dependent variable. The slope of the line is b,
and a is the intercept (the value of y when x = 0).
• 𝑏 = 𝑟
𝑆𝑦
𝑆𝑥
or 𝑏 =
𝑆𝑃
𝑆𝑆𝑥
and 𝑎 = 𝑀𝑦 − 𝑏𝑀𝑥
• r is the Pearson correlation, 𝑆𝑥 is the standard deviation for the X scores, and 𝑆𝑦
is the standard deviation for the Y scores.

• The figure shows a scatter plot of X
and Y values with a straight line
drawn through the center of the data
points.
• The straight line is valuable because
it makes the relationship easier to see
and it can be used for prediction.

• First determine whether or not there is a relationship between the variables of
interest. This does not necessarily imply that one variable causes the other (for
example, higher SAT scores do not cause higher college grades), but that there is
some significant association between the two variables.
• A scatterplot can be a helpful tool in determining the strength of the relationship
between two variables. If there appears to be no association between the
proposed explanatory and dependent variables (i.e., the scatterplot does not
indicate any increasing or decreasing trends), then fitting a linear regression
model to the data probably will not provide a useful model.

MULTIPLE REGRESSION
• Multiple linear regression (MLR), also known simply as multiple regression, is a
statistical technique that uses several explanatory variables to predict the
outcome of a response variable.
• The goal of multiple linear regression is to model the linear relationship between
the explanatory (independent) variables and response (dependent) variables.

• The formula for a multiple linear
regression is:
• Independent variables x1, x2, and so
on.
• The number of independent variables
can grow till n.
• 𝑦 = 𝑏1x1 + 𝑏2𝑥2 + ⋯ 𝑏𝑛𝑥𝑛 + 𝑎
• 𝑦 = 𝑏1𝑥1 + 𝑏2𝑥2 + 𝑎

• Example: A researcher decides to
study students’ performance from a
school over a period of time. He
observed that as the lectures proceed
to operate online, the performance of
students started to decline as well.
• The parameters for the dependent
variable “decrease in performance” are
various independent variables like
of attention, more internet addiction,
neglecting studies” and much more.
• The multiple regression equation
would be:
• Y = b1 * attention + b2 * internet
addiction + b3 * technology support
+ … BnXn + a

• Multiple regression helps us to better study the various predictor variables at
hand.
• It increases reliability by avoiding dependency on just one variable and have more
than one independent variable to support the event.
• Multiple regression analysis permits you to study more formulated hypotheses
that are possible.

REFERENCE
• Gravetter, F. J., & Forzano, L. B. (2011). Research Methods for the Behavioral
Sciences, 4th Edition. In Descriptive Statistics (4th ed., pp. 434–451). Wadsworth
Publishing.

Descriptive Statistics.pptx

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Descriptive Statistics.pptx

Semelhante a Descriptive Statistics.pptx (20)

Último

Último (20)

Descriptive Statistics.pptx

Notas do Editor