This 10 hours class is intended to give students the basis to empirically solve statistical problems. Talk 1 serves as an introduction to the statistical software R, and presents how to calculate basic measures such as mean, variance, correlation and gini index. Talk 2 shows how the central limit theorem and the law of the large numbers work empirically. Talk 3 presents the point estimate, the confidence interval and the hypothesis test for the most important parameters. Talk 4 introduces to the linear regression model and Talk 5 to the bootstrap world. Talk 5 also presents an easy example of a markov chains.
All the talks are supported by script codes, in R language.
Measures of Central Tendency: Mean, Median and Mode
Talk 2
1. Statistics Lab
Rodolfo Metulini
IMT Institute for Advanced Studies, Lucca, Italy
Lesson 2 - Application to the Central Limit Theory - 16.01.2015
2. Introduction
The modern statistics was built and developed around the normal
distribution.
Academic world use to say that, if the empirical distribution is
normal (or approximative normal), everything works good. This
depends mainly on the sample dimension
Said this, it is important to undestand in which circumstances we
can state the distribution is normal.
Two founding statistical theorems are helpful: The Central Limit
Theorem and The Law of Large Numbers.
3. The Law of Large Numbers (LLN)
Suppose we have a random variable X with expected value
E(X) = µ.
We extract n observation from X (say {x = x1, x2, ..., xn}).
If we define ˆXn = i xi
n = x1+x2+...+xn
n , the LLN states that, for
n −→ ∞,
ˆXn −→ µ
4. The Central Limit Theorem (CLT)
Suppose we have a random variable X with expected value
E(X) = µ and v(X) = σ2
We extract n observation from X (say {x = x1, x2, ..., xn}).
Lets define ˆXn = i xi
n = x1+x2+...+xn
n .
ˆXn distributes with expected value µ and variance
σ2
n
.
In case n −→ ∞ (in pratice n > 30)
ˆXn ∼ N(µ,
σ2
n
), whatever the distribution of x be.
N.B. If X is normal distributed, ˆXn ∼ N(µ,
σ2
n
) even if
n < 30
5. CLT: Empiricals
To better understand the CLT, it is recommended to examine the
theorem empirically and step by step.
By the introduction of new commands in the R programming
language.
In the first part, we will show how to draw and visualize a sample
of random numbers from a distribution.
Then, we will examine the mean and standard deviation of the
sample, then the distribution of the sample means.
6. Drawing random numbers - 1
We already introduced the use of the letters d, p and q in relations
to the various distributions (e.g. normal, uniform, exponential). A
reminder of their use follows:
d is for density: it is used to find values of the probability
density function.
p is for probability: it is used to find the probability that the
random variable lies on the left of a giving number.
q is for quantile: it is used to find the quantiles of a given
distribution.
There is a fourth letter, namely r, used to draw random numbers
from a distribution. For example runif and rexp would be used to
draw random numbers from the uniform and exponential
distributions, respectively.
7. Drawing random numbers - 2
Let use the rnorm command to draw 500 number at random from
a normal distribution having mean 100 and standard deviation (sd)
10.
> x= rnorm(500,mean=100,sd=10)
The results, typing in the r consolle x, is a list of 500 numbers
extracted at random from a normal distribution with mean 500 and
sd 100.
When you examine the numbers stored in the vector x, There is a
sense that you are pulling random numbers that are clumped about
a mean of 100. However, a histagram of this selection provides a
different picture of the data stored.
> hist(x,prob=TRUE)
8. Drawing random numbers - Comments
Several comments are in order regarding the histogram in the
figure.
1. The histogram is approximately normal in shape.
2. The balance point of the histogram appears to be located
near 100, suggesting that the random numbers were drawn
from a distribution having mean 100.
3. Almost all of the values are within 3 increments of 10 from
the mean, suggesting that random numbers were drawn from
a normal distribution having standard deviation 10.
9. Drawing random numbers - a new drawing
Lets try the experiment again, drawing a new set of 500 random
numbers from the normal distribution having mean 100 and
standard deviation 10:
> x = rnorm(500, mean = 100, sd = 10)
> hist(x, prob = TRUE, ylim = c(0, 0.04))
Give a look to the histogram ... It is different from the first one,
however, it share some common traits: (1) it appears normal in
shape; (2) it appears to be balanced around 100; (3) all values
appears to occur within 3 increments of 10 of the mean.
This is a strong evidence that the random numbers have been
drawn from a normal distribution having mean 100 and sd 10. We
can provide evidence of this claim by imposing a normal density
curve:
> curve(dnorm(x, mean = 100, sd = 10), 70, 130, add =
TRUE, lwd = 2, col = ”red”))
10. The curve command
The curve command is new. Some comments on its use
follow:
1. In its simplest form, the sintax curve(f (x), from =, to =)
draws the function defined by f(x) on the interval (from, to).
Our function is dnorm(x, mean = 100, sd = 10). The curve
command sketches this function of X on the interval
(from,to).
2. The notation from = and to = may be omitted if the
arguments are in the proper order to the curve command:
function first, value of from second, value of to third. That is
what we have done.
3. If the argument add is set to TRUE, then the curve is added
to the existing figure. If the arument is omitted (or FALSE)
then a new plot is drawn,erasing the prevoius graph.
11. The distribution of ˆXn (sample mean)
In our previous example we drew 500 random numbers from a
normal distribution with mean 100 and standard deviation 10. This
leads to ONE sample of n = 500. Now the question is: what is
the mean of our sample?
> mean(x)
[1]100.14132
If we take another sample of 500 random numbers from the SAME
distribution, we get a new sample with different mean.
> x = rnorm(500, mean = 100, sd = 10)
mean(x)
[1]100.07884
What happens if we draw a sample several times?
12. Producing a vector of sample means
We will repeatedly sample from the normal distribution, 500 times.
Each of the 500 samples will select 5 random numbers (instead of
500) from the normal distribution having mean 100 and sd 10. We
will then compute the mean of those samples.
We begin by declaring the mean and the standard deviation. Then,
we declare the sample mean.
> µ = 100; σ = 10
> n = 5
We then need some place to store the mean of the samples. We
initalize a vector xbar to initially contain 500 zeros.
> xbar = rep(0, 500)
13. Producing a vector of sample means - cycle for
It is easy to draw a sample of size n = 5 from the normal
distribution having mean µ = 100 and standard deviation σ = 10.
We simply issue the command
rnorm(n, mean = µ, sd = σ).
To find the mean of this results, we simply add the
adjustment
mean(rnorm(n, mean = µ, sd = σ)).
The final step is to store this results in the vector xbar. Then we
must repeat this same process an addintional 499 times. This
require the use of a for loop.
> for(iin1 : 500){xbar[i] = mean(rnorm(n, mean = µ, sd =
σ))}
14. Cycle for
The i in for(iin1 : 500) is called the index of the for loop.
The index i is first set equal to 1, then the body of the for
loop is executed. On the next iteration, i is set equal to 2 and
the body of the loop is executed again. The loop continues in
this manner, incrementing by 1, finally setting the index i to
500. After executing the last loop, the for cycle is terminated
In the body of the for loop, we have
xbar[i] = mean(rnorm(n, mean = µ, sd = σ)). This draws a
sample of size 5 from the normal distribution, calculates the
mean of the sample, and store the results in xbar[i].
When the for loop completes 500 iterations, the vector xbar
contains the means of 500 samples of size 5 drawn from the
normal distribution having µ = 100 and σ = 10
> hist(xbar, prob = TRUE, breacks = 12, xlim = c(70, 130, ylim =
c(0, 0.1)))
15. Distribution of ˆXn - observations
1. The previous histograms describes the shape of the 500
random number randomly selected, here, the histogram
describe the distribution of 500 different sample means, each
of which founded by selecting n = 5 random number from the
normal distribution.
2. The distribution of xbar appears normal in shape. This is so
even though the sample size is relatively small ( n = 5).
3. It appears that the balance point occurs near 100. This can
be checked with the following command:
> mean(xbar)
That is the mean of the sample means, that is almost equal to
the mean of the draw of random numbers.
4. The distribution of the sample means appears to be narrower
then the random number distributions.
16. Increasing the sample size
Lets repeat the last experiment, but this time let’s draw a sample
size of n = 10 from the same distribution (µ = 100, σ = 10)
> µ = 100; σ = 10
> n = 10
> xbar = rep(0, 500)
> for(iin1 : 500){xbar[i] = mean(rnorm(n, mean = µ, sd =
σ))}
hist(xbar, prob = TRUE, breaks = 12, xlim = c(70, 130), ylim =
c(0, 0.1))
The Histogram produced is even more narrow than using
n = 5
17. Key Ideas
1. When we select samples from a normal distribution, then the
distribution of sample means is also normal in shape
2. The mean of the distribution of sample means appears to be
the same as the mean of the random numbers
(parentpopulation) (see the balance points compared)
3. By increasing the sample size of our samples, the histograms
becomes narrower. Infact, we would expect a more accurate
estimate of the mean of the parent population if we take the
mean from a larger sample size.
4. Imagine to draw sample means from a sample of n = ∞. The
histogram will be exactly concentrated (P = 1) in Xbar = µ,
since the variance is σ2/∞
18. Summarise
We finish replicating the statement about CLT:
1. If you draw samples from a normal distribution, then the
distribution of the sample means is also normal
2. The mean of the sample means is roughly identical to the
mean of the parent population
3. The higher the sample size that is drawn, the narrower will be
the spread of the distribution of the sample means.
19. Homeworks
Experiment 1: Draw the Xbar histogram for n = 1000. How is
the histogram shape?
Experiment 2: Repeat the full experiment drawing random
numbers and sample means from a (1) uniform and from (2) a
poisson distribution. Is the histogram of Xbar normal in shape for
n = 5 and for n=30?
Experiment 3: Repeat the full experiment using real data instead
of random numbers. (HINT: select samples of dimension n = 5
from the real data, not rnorm)
Recommended: Try to evaluate the agreement of the sample mean
histogram with normal distribution by mean of the qq-plot and
shapiro wilk test.
20. Application to Large Number Law
Experiment: toss the coin 100 times.
This experiment is like repeating 100 times a random draw from a
bernoulli distribution with parameter ρ = 0.5
We expect to have 50 times (value = 1) head and 50 times cross
(value = 0), if the coin is not distorted
But, in practice, this not happen: repeating the experiment we are
going to have a distributions centered in 50, but spread out.
Let’s imagine to define ˆXn as the mean of the number of heads
across n experiments. For n −→ ∞, ˆXn −→ 50
21. Application to Large Number Law - 2
x = rbinom(100,1,0.5) x x2 = rbinom(100,2,0.5)
hist random numbers hist(x)
define the empirical frequency sum(x)
define the empirical frequency for the sample mean xfreq =
rep(0,1000) xfreq
for loop define the number i N = rep(0,1000) for (i in 1:1000) N[i]
= i N
define the cumulated frequency (total) xfreq[1] =
sum(rbinom(100,1,0.5)) xfreq[1] for (i in 2:1000) xfreq[i] =
(sum(rbinom(100,1,0.5)) + xfreq[i-1]) xfreq
define the sample mean (cumulative freq divided by number of
experiments) xfreq2 = rep(0,1000) for (i in 1:1000) xfreq2[i] =
xfreq[i]/i xfreq2 plot(xfreq2, ylim= c(48,52))
mu = rep(50,1000) mu points(mu,col=”red”)