2. Chapter 3:
Describing, Exploring, and Comparing Data
3.1 Measures of Center
3.2 Measures of Variation
3.3 Measures of Relative Standing and Boxplots
2
Objectives:
1. Summarize data, using measures of central tendency, such as the mean, median, mode,
and midrange.
2. Describe data, using measures of variation, such as the range, variance, and standard
deviation.
3. Identify the position of a data value in a data set, using various measures of position,
such as percentiles, deciles, and quartiles.
4. Use the techniques of exploratory data analysis, including boxplots and five-number
summaries, to discover various aspects of data
3. Recall: 3.1 Measures of Center
Measure of Center (Central Tendency)
A measure of center is a value at the center or
middle of a data set.
1. Mean: 𝑥 =
𝑥
𝑛
, 𝜇 =
𝑥
𝑁
, 𝑥 =
𝑓∙𝑥 𝑚
𝑛
2. Median: The middle value of ranked data
3. Mode: The value(s) that occur(s) with the
greatest frequency.
4. Midrange: 𝑀𝑟 =
𝑀𝑖𝑛+𝑀𝑎𝑥
2
5. Weighted Mean: 𝑥 =
𝑤∙𝑥
𝑤
3
4. Key Concept: Variation is the single most important topic in statistics.
This section presents three important measures of variation: range, standard
deviation, and variance.
3.2 Measures of Variation
4
1. Range = Max - Min
2. Variance
3. Standard Deviation
4. Coefficient of Variation
5. Chebyshev’s Theorem
6. Empirical Rule (Normal)
7. Range Rule of Thumb for
Understanding Standard Deviation
𝑠 ≈
𝑅𝑎𝑛𝑔𝑒
4
& µ ± 2σ
1 – 1/k2
Use CVAR to compare
variabiity when the units are
different.
100%
s
CVAR
X
5. 5
Example 1: Two brands of outdoor paint are tested to see how long each
will last before fading. The results (in months) for a sample of 6 cans are
shown. Find the mean and range of each group.
a. Find the mean and range of each group.
b. Which brand would you buy?
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
210
Brand A: 35, 60 10 50
6
x
x R
n
210
35
Brand B: 6
45 25 20
x
x
n
R
The average for both brands is the same, but the range
for Brand A is much greater than the range for Brand B.
Which brand would you buy?
𝑅 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛, 𝑥 =
𝑥
𝑛
, 𝜇 =
𝑥
𝑁
, 𝑠 =
(𝑥− 𝑥)2
𝑛−1
Range = Maximum data value − Minimum data value
3.2 Measures of Variation
The range uses only the maximum and the minimum data values, so it is very sensitive
to extreme values. Therefore, the range is not resistant, it does not take every value
into account, and does not truly reflect the variation among all of the data values.
6. Variance & Standard Deviation
6
3.2 Measures of Variation
The variance is the average of the squares of the distance each value is from the mean.
The standard deviation is the square root of the variance.
The standard deviation is a measure of how spread out your data are and how much data values deviate
away from the mean.
Notation
s = sample standard deviation
σ = population standard deviation
Usage & properties:
1. To determine the spread of the data.
2. To determine the consistency of a variable.
3. To determine the number of data values that fall within a specified interval in a distribution (Chebyshev’s
Theorem).
4. Used in inferential statistics.
5. The value of the standard deviation s is never negative. It is zero only when all of the data values are exactly the
same.
6. Larger values of s indicate greater amounts of variation.
7. Variance & Standard Deviation
7
3.2 Measures of Variation
2
2
Population Variance:
X
N
2
Population Standard Deviation:
X
N
2
2
2
2
Sample Variance:
1
1
X X
X X
s
n
n
n n
2
2
2
Sample Standard Deviation
1
1
:
X X
X X
s
n
n
n n
1. The standard
deviation is effected
by outliers.
2. The units of the
standard are the same
as the units of the
original data values.
3. The sample standard
deviation s is a
biased estimator of
the population
standard deviation σ,
which means that
values of the sample
standard deviation s
do not center around
the value of σ.
TI Calculator:
How to enter data:
1. Stat
2. Edi
3. Highlight & Clear
4. Type in your data in
L1, ..
TI Calculator:
Mean, SD, 5-number
summary
1. Stat
2. Calc
3. Select 1 for 1 variable
4. Type: L1 (second 1)
5. Scroll down for 5-
number summary
8. Example 2
8
Given the data speeds (Mbps): 38.5, 55.6, 22.4, 14.1, 23.1.
a. Find the range of these data speeds (Mbps):
b. Find the standard deviation
Range =Max − Min = 55.6 − 14.1 = 41.50 Mbps
38.5 55.6 22.4 14.1 23.1
b.
5
X
153.7
30.74
5
Mbps
2 2 2 2 2
38.5 30.74 55.6 30.74 22.4 30.74 14.1 30.74 23.1 30.74
5 1
s
OR:
1083.0520
4
16.45Mbps
2 2
2
1
5(5807.79) 153.7
5 5 1
5415.26
16.45
20
X Xn
s
n n
Mbps
𝑅 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛
𝑥 =
𝑥
𝑛
, 𝑠 =
(𝑥 − 𝑥)2
𝑛 − 1
9. Example 3
9
Find the variance and standard deviation for the population
data set for Brand A paint. 10, 60, 50, 30, 40, 20.
Months, X µ X – µ (X – µ)2
10
60
50
30
40
20
35
35
35
35
35
35
–25
25
15
–5
5
–15
625
625
225
25
25
225
1750
1750
17.1
6
Months
2
2 X
N
1750
291.7
6
𝑅 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛, 𝑥 =
𝑥
𝑛
, 𝜇 =
𝑥
𝑁
, 𝑠 =
(𝑥− 𝑥)2
𝑛−1
, 𝜎 =
(𝑥−𝜇)2
𝑁
10 60 50 30 40 20
35
5
10. Example 4
10
Find the variance and standard deviation for the amount
of European auto sales for a sample of 6 years. The data
are in millions of dollars.
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
X X 2
11.2
11.9
12.0
12.8
13.4
14.3
125.44
141.61
144.00
163.84
179.56
204.49
958.9475.6
2 2
2
1
X Xn
s
n n
2
2 75.66 958.94
6 5
s
2
1.28
1.13
s
s
2 2
6 958.94 75.6 / 6 5 s
𝑅 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛
𝑥 =
𝑥
𝑛
, 𝑠 =
(𝑥 − 𝑥)2
𝑛 − 1
𝜎 =
(𝑥 − 𝜇)2
𝑁
11. Range Rule of Thumb for Understanding Standard Deviation
The range rule of thumb is a crude but simple tool for understanding and
interpreting standard deviation. The vast majority (such as 95%) of sample
values lie within 2 standard deviations of the mean.
11
Variance & Standard Deviation3.2 Measures of Variation
Unusual:
Significantly low values are µ − 2σ or lower.
Significantly high values are µ + 2σ or higher.
Usual:
Values not significant are between (µ − 2σ ) and (µ + 2σ).
Range Rule of Thumb for Estimating a Value of the Standard Deviation
To roughly estimate the standard deviation from a collection of known sample data
(when the distribution is unimodal and approximately symmetric), use: 𝑠 ≈
𝑅𝑎𝑛𝑔𝑒
4
12. The Empirical Rule
The empirical rule states that for
data sets having a distribution that
is approximately bell-shaped, the
following properties apply.
• About 68% of all values fall within 1
standard deviation of the mean.
• About 95% of all values fall within 2
standard deviations of the mean.
• About 99.7% of all values fall within 3
standard deviations of the mean.
12
3.2 Measures of Variation
13. Example 5
13
IQ scores have a bell-shaped distribution with a mean of 100 and a
standard deviation of 15. What percentage of IQ scores are between 70 and
130?
130 − 100 = 30 & 100 − 70 = 30
The empirical rule: About 95% of all IQ scores are between 70 and 130.
30
𝜎
=
30
15
= 2
Example 6
Use Range Rule of Thumb to approximate the lowest value and the highest value in a
data set where 𝑥 = 10 & 𝑅 = 12.
µ ± 2σ
4
R
s
12
3
4
𝑥 ± 2𝑠 = 10 ± 2(3)
𝐿𝑜𝑤 = 4 & ℎ𝑖 = 16
𝑅 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛, 𝑥 =
𝑥
𝑛
, 𝜇 =
𝑥
𝑁
, 𝑠 ≈
𝑅
4
, 𝑠 =
(𝑥− 𝑥)2
𝑛−1
, 𝜎 =
(𝑥−𝜇)2
𝑁
14. 14
Chebyshev’s Theorem3.2 Measures of Variation
The proportion of values from any data set that fall within k standard deviations of the
mean will be at least 1 – 1/k2, where k is a number greater than 1 (k is not necessarily
an integer).
# of standard
deviations, k
Minimum Proportion
within k standard
deviations
Minimum Percentage within k
standard deviations
2 1 – 1/4 = 3/4 75%
3 1 – 1/9 = 8/9 88.89%
4 1 – 1/16 = 15/16 93.75%
15. Example 7
15
The mean price of houses in a certain neighborhood is $50,000, and the
standard deviation is $10,000. Find the price range for which at least 75%
of the houses will sell. Chebyshev’s Theorem states that:
At least 75% of a data set will fall within 2 standard deviations of the mean.
Why Chebyshev’s Theorem?
1 – 1/k2
50,000 – 2(10,000) = 30,000
50,000 + 2(10,000) = 70,000
Example 8: A survey of local companies found that the mean amount of
travel allowance for executives was $0.25 per mile. The standard deviation
was 0.02. Using Chebyshev’s theorem, find the minimum percentage of
the data values that will fall between $0.20 and $0.30.
.30 .25 /.02 2.5
.25 .20 /.02 2.5
2 2
1 1/ 1 1/ 2.5k 2.5k
0.84 84%
µ=0.25, σ = 0.02
16. Comparing Variation in Different Samples or Populations
Coefficient of Variation
The coefficient of variation (or CV) for a set of nonnegative sample or population
data, expressed as a percent, describes the standard deviation relative to the mean,
and is given by the following:
The coefficient of variation is the standard deviation divided by the mean,
expressed as a percentage.
Use CVAR to compare standard deviations when the units are different.
16
Properties of Variance3.2 Measures of Variation
100%
s
CV
x
100%CV
17. Example 9
17
3.2 Measures of Variation
The mean of the number of sales of cars over a 3-month period is 87, and
the standard deviation is 5. The mean of the commissions is $5225, and
the standard deviation is $773. Compare the variations of the two.
Commissions are more variable than sales.
5
100% 5.7% Sales
87
CVar
773
100% 14.8% Commissions
5225
CVar
100%CV
18. Properties of Variance
The units of the variance are the squares of the units of the original data values.
The value of the variance can increase dramatically with the inclusion of
outliers. (The variance is not resistant.)
The value of the variance is never negative. It is zero only when all of the data
values are the same number.
The sample variance s² is an unbiased estimator of the population variance σ².
18
3.2 Measures of Variation
Why Divide by (n – 1)?
There are only n − 1 values that can be assigned without constraint. With a
given mean, we can use any numbers for the first n − 1 values, but the last value
will then be automatically determined.
With division by n − 1, sample variances s² tend to center around the value of
the population variance σ²; with division by n, sample variances s² tend to
underestimate the value of the population variance σ².
19. Biased and Unbiased Estimators
The sample standard deviation s is a biased estimator of the population standard
deviation s, which means that values of the sample standard deviation s do not
tend to center around the value of the population standard deviation σ.
The sample variance s² is an unbiased estimator of the population variance σ²,
which means that values of s² tend to center around the value of σ² instead of
systematically tending to overestimate or underestimate σ².
19
2
2
2
2
2
)(
,
N
fm
N
N
mf
fm
N
mf
Recall for a Grouped Data: m is the Midpoint of a class