2. 1. Data visualisation
2. Covariance and correlation
3. Normal distribution
4. Central limit theorem (CLT)
5. Law of large numbers (LLN)
6. Hypothesis testing
Content
3. Assessment Profile
Mid-term Coursework Assignment: 30% of the overall mark
▪ List of five exercises to be performed remotely within a 24-hours
period.
▪ Deadline: 18/03/2022 at 10:00am
Final Coursework Assignment: 70% of the overall mark
▪ Report showing a competent application of quantitative methods
and data analysis concepts learned in our module, exploring a topic
of your own interest.
▪ Word limit: 2000 words.
▪ Deadline: 22/04/2022 at 10:00am
4. Final Coursework Assignment 70%
1. Instructions and Guidance
• In this Report (2000 words), please proceed as follows:
• Select a topic which you are really interested exploring.
In the case you want me to select your, that is completely fine, please just inform
me about this and I will provide you a topic to be explored.
• Decide which research question are you going to address.
• Collect data related to your topic and research question.
• Decide which quantitative research method(s) are you going to adopt.
• Perform data analyses applying quantitative research method(s) learnt in this
module on your data using R/ R Studio.
• Detail the method(s) adopted and discuss your findings in your individual Report.
5. Final Coursework Assignment 70% (Cont.)
The structure of this Report should consist of the following brief sections:
• Section 1. Introduction: Briefly mention your topic, question, input data, and
analyses performed;
• Section 2. Data: Detail your dataset, including data source, temporal coverage,
sample size;
• Section 3. Results: Describe the quantitative research methods adopted and data
analyses performed, reporting your results using a complementary chart and
table, discussing your findings;
• Section 4. Conclusion: Summarise your Report, briefly describing the main
quantitative research method adopted as well as your most relevant/ interesting
finding.
• Appendix. Attach an image/ figure (e.g. code print screen) evidencing that you
performed your data analyses using R/ R Studio.
6. Final Coursework Assignment 70% (Cont.)
2. Assessment Rubric with Weighted Criteria
• Following the structure of the Report, five rubrics are assessed, each
item contributing with its respective weight to this coursework
assignment overall mark (totalling 100 points), as follows:
• Section 1. Introduction – weight: 15% of the coursework assignment overall
mark;
• Section 2. Data – weight: 20% of the coursework assignment overall mark;
• Section 3. Results – weight: 40% of the coursework assignment overall mark;
• Section 4. Conclusion – weight: 15% of the coursework assignment overall mark;
• Appendix – weight: 10% of the coursework assignment overall mark.
7. Final Coursework Assignment 70% (Cont.)
3. Assessment Criterion
This Report adopts the following undergraduate (UG) performance
thresholds:
• “Exceeds expectations” at equivalent of 60 or more points;
• “Meets expectations” at equivalent between 40 and 59 points;
• “Does not meet expectations” at equivalent of 39 or less points.
8. Defining Statistics
• Statistics has the power to turn raw data into information which may
effectively support the decision-making process.
In fact, this is the life’s blood of any modern business.
• It is a crucial linking the chain connecting data to information,
information to knowledge, and knowledge to action/ decision.
They are all part of an informed decision process.
• “Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write” (H. G. Wells)
Statistics is the art and science of collecting, analysing,
interpreting and presenting data aiming at transforming that
data into useful information
Stats Recap
9. Stats Recap
• Can you remember the following basic statistics concepts?
▪ p-value?
▪ Hypothesis testing?
▪ Type I and II errors?
▪ Normal distribution?
▪ Central limit theorem?
▪ Covariance and correlation?
• Understanding elementary statistics is crucial to navigate through most
of quantitative research methods and data analytics.
• That is the reason we are reviewing some key concepts in statistics.
10. Defining Statistics
• Statistics has the power to turn raw data into information which may
effectively support the decision-making process.
In fact, this is the life’s blood of any modern business.
• It is a crucial linking the chain connecting data to information,
information to knowledge, and knowledge to action/ decision.
They are all part of an informed decision process.
• “Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write” (H. G. Wells)
Statistics is the art and science of collecting, analysing,
interpreting and presenting data aiming at transforming that
data into useful information
Data Visualisation
11. Data Visualisation
• Never trust summary statistics alone.
• Always visually explore your data.
• Relying only on data summaries (e.g. mean, standard deviation,
correlations) may be misleading because wildly different datasets may
give similar – if not identical – results.
• This is a principle that has been demonstrated for decades, for instance
through the Anscombe’s Quartet (1973).
Source 1: https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html
Source 2: http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
12. • Anscombe’s quartet comprises four datasets that have nearly identical
simple descriptive statistics, yet have very different distributions and
appear very different when graphed.
• Each dataset consists of eleven (𝑥, 𝑦) points.
• They were constructed to demonstrate both the importance of
graphing data before analysing it.
• It was created in to oppose to the impression “numerical calculations
are exact, but graphs are rough”.
Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Data Visualisation (Cont.)
13. Dataset I Dataset II Dataset III Dataset IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Data Visualisation (Cont.)
14. Statistic/ Property Value Accuracy
Mean of 𝑥 9 Exact
Variance of 𝑥: 𝑠𝑥
2
11 Exact
Mean of 𝑦 7.50 Two decimal places
Variance of 𝑦: 𝑠𝑦
2
4.125 ±0.003
Correlation between 𝑥 and 𝑦 0.816 Three decimal places
Linear regression line 𝑦 = 3.00 + 0.500𝑥 Two and three decimal places, respectively
Coefficient of determination: 𝑅2
0.67 Two decimal places
Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Data Visualisation (Cont.)
• All four Anscombe’s quartet datasets yield the following statistical
measures:
15. • Identical summary statistics but radically different charts:
Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Data Visualisation (Cont.)
16. • In addition, it is also possible to generate bivariate data with a given
mean, median, and correlation in virtually any shape, from a circle to a
star to a dinosaur, as follows:
Source 1: https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html
Source 2: http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
Data Visualisation (Cont.)
17. Defining Statistics
• Statistics has the power to turn raw data into information which may
effectively support the decision-making process.
In fact, this is the life’s blood of any modern business.
• It is a crucial linking the chain connecting data to information,
information to knowledge, and knowledge to action/ decision.
They are all part of an informed decision process.
• “Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write” (H. G. Wells)
Statistics is the art and science of collecting, analysing,
interpreting and presenting data aiming at transforming that
data into useful information
Covariance and
Correlation
18. • Measures of association and related data visualisation techniques
include the following:
• Covariance
• Correlation
• Scatter diagram and trendline
• These show the degree of association or relationship between two
variables, but do not imply causation.
• The behaviour of one does not necessarily cause the behaviour of the
other.
Measures of Association
19. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Covariance
20. Covariance
• Positive values indicate a positive linear relationship
• Negative values indicate a negative linear relationship
• If the dataset refers to a sample, the covariance is denoted by Cov 𝑥, 𝑦
• The covariance may be calculated as follows:
Cov 𝑥, 𝑦 =
σ 𝑥𝑖 − ҧ
𝑥 𝑦𝑖 − ത
𝑦
𝑛 − 1
• If the dataset refers to a population, the covariance is then calculated as
follows:
Cov 𝑥, 𝑦 =
σ 𝑥𝑖 − 𝜇𝑥 𝑦𝑖 − 𝜇𝑦
𝑁
21. Covariance (Cont.)
• In (a), an upward sloping line best describes the points, indicating a
positive covariance.
• In (b), the downward sloping line implies a negative covariance.
• In (c), the line has 0 slope, which means a covariance of 0.
x
y
(a) Positive
x
y
(b) Negative
x
y
(c) Zero
22. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Correlation
23. Correlation
• The coefficient is a standardized measure (no units) and takes on
values between −1 and +1.
• Values near −1 suggest a strong negative linear relationship.
• Values near +1 suggest a strong positive linear relationship.
• If the datasets are samples, the coefficient is denoted by 𝑟𝑥𝑦, as follows:
𝑟𝑥𝑦 =
Cov 𝑥, 𝑦
𝑠𝑥𝑠𝑦
• If the datasets are populations, the coefficient is denoted by 𝑟𝑥𝑦, as
follows:
𝑟𝑥𝑦 =
Cov 𝑥, 𝑦
𝜎𝑥𝜎𝑦
24. Correlation (Cont.)
• The formula 𝑟𝑥𝑦 =
𝐶𝑜𝑣 𝑥,𝑦
𝑠𝑥𝑠𝑦
may be alternatively understood as follows:
𝑟𝑥𝑦 =
Amount that 𝑥 and 𝑦 vary together
Total variability in 𝑥 and 𝑦
• Correlation measures the strength of the relationship between two
variables.
• It aims answering the following question:
▪ When 𝑥 gets larger, does 𝑦 consistently get larger (or smaller)?
• Often measured with Pearson’s correlation coefficient:
▪ Commonly called “correlation coefficient” or even only “correlation”
▪ Almost always represented with the letter 𝑟
29. Scatter Diagram
• Also known as scatter plot or 𝑥-𝑦 graph.
• Scatter diagram graphs pair numerical data, with one variable on each
axis, to verify a relationship between them.
• If the variables are correlated, the points will fall along a line or curve.
• The better the correlation, the tighter the points will hug the line.
Source: http://www.tylervigen.com/spurious-correlations
30. Scatter Diagram: When to Use
• When we have paired numerical data
• When the dependent variable may have multiple values for each value
of the independent variable.
• When trying to determine whether the two variables are related, such
as:
▪ When trying to identify potential root causes of problems.
▪ After considering causes and effects to determine objectively whether a
particular cause and effect are related.
▪ When determining whether two effects that appear to be related both
occur with the same cause.
▪ When checking for autocorrelation.
Source: https://asq.org/quality-resources/scatter-diagram
31. Scatter Diagram Considerations
• Even if the scatter diagram shows a relationship, do not assume that one
variable caused the other. Both may be influenced by a third variable.
• When the data are plotted, the more the diagram resembles a straight line,
the stronger the relationship.
• If a line is not clear, statistical measures determine whether there is
reasonable certainty that a relationship exists.
• If the statistics say that no relationship exists, the pattern could have
occurred by random chance.
• If the diagram shows no relationship, consider whether the independent
(𝑥-axis) variable has been varied widely.
• Sometimes a relationship is not apparent because the data do not cover a
wide enough range.
Source: https://asq.org/quality-resources/scatter-diagram
33. Defining Statistics
• Statistics has the power to turn raw data into information which may
effectively support the decision-making process.
In fact, this is the life’s blood of any modern business.
• It is a crucial linking the chain connecting data to information,
information to knowledge, and knowledge to action/ decision.
They are all part of an informed decision process.
• “Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write” (H. G. Wells)
Statistics is the art and science of collecting, analysing,
interpreting and presenting data aiming at transforming that
data into useful information
Normal
Distribution
34. Distributions
• A distribution is simply a collection of data or scores (e.g. z-score, t
score) of a variable.
• The values of a distribution are commonly ordered (e.g. from smallest
to largest).
• Distributions are commonly depicted using data visualisation tools
(e.g. charts).
• A probability distribution is a mathematical function that calculates the
probability of possible outcomes.
• Real-world data can or cannot follow a particular established
theoretical distribution (i.e. theoretical distribution vs data
distribution).
35. A Simplified Map of Popular Distributions
Source: https://medium.com/mytake/understanding-different-types-of-distributions-you-will-encounter-as-a-data-scientist-27ea4c375eec
36. Normal Distribution
• The Normal (also know and Gaussian) probability distribution is the most
important distribution for describing a continuous random variable in
statistics.
• It plays a crucial role in the theory of sampling and is widely used in statistical
inference.
• Many natural phenomena have patterns that resemble the normal
distribution (e.g. body weight, shoe size, IQ, etc).
• Many statistics are based on the assumption of normality.
37. • In terms of parameters, the Normal distribution contains a mean 𝜇 and
a variance 𝜎2
(and, consequently, standard deviation 𝜎), determining
the centre and width of the distribution.
• The highest point on the Normal curve is at the mean, which is also the
median and the mode.
• The standard deviation determines the width of the curve.
Thus, larger values result in wider, flatter curves.
• The Normal curve is symmetric. Therefore, 0.5 probability to the left of the
mean and 0.5 probability to the right.
Normal Distribution (Cont.)
38. Graph of the Normal Distribution
• The shape of the Normal distribution resembles a shape of a bell (i.e. bell-
shaped curve).
𝜇 = mean
x
f(x)
39. Standard Normal Distribution
• A random variable that has a normal distribution with a mean of zero
and a standard deviation of one is said to have a standard normal
probability distribution.
• The letter z is commonly used to designate the standard normal
random variable. More specifically, a z-score.
• We calculate z-scores for a Normal distribution as follows:
𝑧 =
𝑥 − 𝜇
𝜎
• Intuitively, we may think of z as a measure of the number of standard
deviations that 𝑥 is distant from 𝜇.
40. Normal Table Applications
• We may use the standard Normal distribution in two ways, namely
forward and in reverse.
• Forward:
▪ For a given data value 𝑥, calculate 𝑧 and find the probability, or
area, associated with 𝑧.
• In reverse:
▪ For a given probability or area, find 𝑧 and then calculate the data
value 𝑥 associated with that area using the following formula:
𝑥 = 𝜇 + 𝑧𝜎
43. Standard Normal Distribution (𝒛 of +1.2)
0
z
0.8849
0.8849
1.2
• It also means that 1 − 0.885 (or 11.5%) of the data falls between 𝑧 =
1.2 and 𝑧 = +∞ (i.e. orange area in the figure above).
45. Rule 68-95-99.7 (or Empirical Rule)
• The Rule 68-95-99.7 (or Empirical Rule) is applied to remember the
percentage of values that lie within an interval estimate of the Normal
distribution.
• This rule works only with the Normal distribution.
• Approximately 68.3% of the data values will be within one standard
deviation of the mean.
• Approximately 95.5% of the data values will be within two standard
deviations of the mean.
• Approximately 99.7% (i.e. almost all) of the data values will be within
three standard deviations of the mean.
47. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Central Limit
Theorem (CLT)
48. • This is one of the most important theorems in statistics.
• A sample size of 𝑛 ≥ 30 is considered large.
• Whenever the population has a normal distribution, the sampling
distribution of the sample mean has a normal distribution for any
sample size.
As sample size increases, the sampling distribution of the
sample mean rapidly approaches the bell shape of a normal
distribution, regardless of the shape of the parent population.
In small sample cases (𝑛 < 30), the sampling distribution of
sample mean will be normal so long as the parent
population is normal.
Central Limit Theorem (CLT)
49. n = 2
n = 5
n = 30
𝑥
Population Shapes
The Sampling Distribution of the Sample Mean
𝑥 𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
⋮
Central Limit Theorem (CLT) (Cont.)
50. Population Distribution Sampling Distribution
n = 2
n = 20
n = 8
Sampling Distribution of the Sample Mean for Samples of Size n =
2, n = 8, and n = 20 Selected from the Same Population
𝑥 ҧ
𝑥
Sampling Distribution of the Sample Mean
51. ҧ
𝑥
Shape of the Sampling Distribution
When Sample Size is Large (n > 30)
52. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Law of Large
Numbers (LLN)
53. Law of Large Numbers (LLN)
• The law of large numbers (LLN) consists of a theorem that describes
the result of performing the same experiment a large number of times.
• According to the LLN, the mean of the results obtained from a large
number of trials should be close to the expected value and tends
towards the expected value as more trials are performed.
• The LLN is relevant due to the fact that it guarantees stable long-term
results for the averages of some random events.
• For example, while a casino may lose money in a single spin of the
roulette wheel, its earnings will tend towards a predictable percentage
over a large number of spins.
Source: https://en.wikipedia.org/wiki/Law_of_large_numbers
54. Law of Large Numbers (LLN) (Cont.)
• Any winning streak by a player will eventually be overcome by the
parameters of the game.
• Importantly, the LLN only applies when a large number of observations
is considered.
• There is no principle that a small number of observations will coincide
with the expected value or that a streak of one value will immediately
be “balanced” by the others (e.g. gambler’s fallacy).
• The LLN only applies to the mean value, as follows:
lim
𝑛→∞
𝑖=1
𝑛
𝑋𝑖𝑛−1 − ത
𝑋 = 0
Source: https://en.wikipedia.org/wiki/Law_of_large_numbers
55. Law of Large Numbers (LLN) (Cont.)
• An illustration of the law of large
numbers using a particular run
of rolls of a single die.
• As the number of rolls in this run
increases, the average of the
values of all the results
approaches 3.5.
• Although each run would show a
distinctive shape over a small
number of throws (at the left),
over a large number of rolls (to
the right) the shapes would be
extremely similar.
Source: https://en.wikipedia.org/wiki/Law_of_large_numbers
56. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Statistical
Hypothesis Testing
57. • In hypothesis testing, a statement – call it a hypothesis – is made
about some characteristic of a particular population.
• A sample is then taken in an effort to establish whether or not
the statement is true.
• If the sample produces results that would be highly unlikely
under an assumption that the statement is true, then we’ll
conclude that the statement is false.
▪ The null hypothesis H0 is the statement to be tested.
▪ The alternative hypothesis Ha is the opposite of what is
stated in the null hypothesis.
The Nature of Hypothesis Testing
58. • The status quo or “if-it’s-not-broken-don’t-fix-it”
approach:
• Here the status quo (no change) position serves as the
null hypothesis.
• Compelling sample evidence to the contrary would
have to be produced before we’d conclude that a
change in prevailing conditions has occurred.
• This approach usually involves a decision that needs to
be made if the null hypothesis is rejected.
• Example:
• H0: The machine continues to function properly.
• Ha: The machine is not functioning properly
Establishing the Hypotheses
59. • The skeptic’s approach:
• Here, in order to test claims of “new and improved” or
“better than” or “different from” what is currently the
case, the null hypothesis would reflect the skeptic’s
view which essentially says that “new is no better
than old.”
• This testing is essentially proof by contradiction.
• Example:
• H0: A proposed new headache remedy is no faster
than other commonly used treatments.
• Ha: A proposed new headache remedy is faster than
other commonly used treatments.
Establishing the Hypotheses (Cont.)
60. Standard Forms for the Null and Alternative Hypotheses
A hypothesis test for a population mean m will take one of the following
three forms (where A represents the boundary value for the null
position):
H0: m > A H0: m < A H0: m = A
Ha: m < A Ha: m > A Ha: m ≠ A
One-tailed
(lower tail)
One-tailed
(upper tail)
Two-tailed
Establishing the Hypotheses (Cont.)
61. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Developing a One-
tailed Test
62. • If we’re going to set a boundary to separate “likely” from
“unlikely” sample results in the null sampling
distribution, we’ll need to define just what is meant by
“unlikely.”
• The value we choose, most commonly .05 or 5%, we’ll
label a and refer to it as the significance level of the test
A significance level a is the probability value that
defines just what we mean by unlikely sample
results under an assumption that the null
hypothesis is true (as an equality)
Choosing a Significance Level
63. • Use the standard normal table to find the z value with an
area of a in the lower (or upper) tail of the distribution.
• The value of z that establishes the boundary of the reject
H0 region the critical value for the test, zc.
• To conduct the test, calculate the test statistic, zstat.
• Decision rule (one-tail):
• Lower tail: Reject H0 if zstat < -zc
• Upper tail: Reject H0 if zstat > zc
Establishing a Decision Rule
64. For a = 0.05
z
Reject H0
Do Not Reject H0
a = 0.05
Zc = -1.65 0
One-tailed (Lower) Test about a Population Mean
65. For a = 0.05
z
Reject H0
Do Not Reject H0
a = 0.05
Zc = 1.65
0
One-tailed (Lower) Test about a Population Mean
66. • Failing to reject a null hypothesis shouldn’t be taken to
mean that we necessarily agree that the claim is true.
• We’re simply concluding that there’s not enough sample
evidence to convince us that it’s false.
• It’s for this reason we’ve chosen to use the phrase “fail to
reject” rather than “accept” the claim.
• The court system gives us a good example of this
distinction.
Failing to convict a defendant doesn’t necessarily
mean that the jury believes the defendant is innocent.
It simply means that, in the jury’s judgment, there’s
not strong enough evidence to convince them to reject
that possibility.
Accepting vs Failing to Reject the Null Hypothesis
67. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
p-values
68. The p-value can be used to make the decision in a hypothesis test.
The p-value measures the probability that, if the
null hypothesis is true (as an equality), we would
randomly produce a sample result at least as
unlikely as the sample result that we actually
produce
p-value Decision Rule
If the p-value is less than a, reject the null hypothesis
P-values
69. Step 1: State the null and alternative hypotheses.
Step 2: Choose a test statistic and a significance level for the test.
Step 3: Compute the value of the test statistic from your sample data.
Step 4: Apply the appropriate decision rule and make your decision.
Critical value version: Use the significance level to establish the
critical value for the test statistic. If the test statistic is outside the
critical value, reject the null hypothesis.
P-value version: Use the test statistic to determine the p-value for
the sample result. If the p-value is less than a, the significance level
of the test, reject the null hypothesis.
Generalising the Test Procedure
70. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Type I and II Errors
71. Whenever we make a judgment about a population parameter
based on sample information, there’s a chance we could be
wrong. In hypothesis testing, in fact, we can identify two
types of potential errors.
a, the significance level, measures the maximum
probability of making a Type I error.
Type I Error: Rejecting a true null hypothesis
Type II Error: Accepting a false null hypothesis
The Possibility of Error
72. • Type I Error: In hypothesis testing, we control for the risk of making a
Type I error when we set the value of a.
• Type II Error: Measuring and controlling for the risk of making a Type II
error, denoted by b, is more difficult.
• Statisticians avoid the risk of making a Type II error by using “do not
reject H0” instead off “accept H0”.
• Choosing a Significance Level: If the cost of a Type I error is high, we’ll
want to use a relatively small a in order to keep the risk low.
The Possibility of Error (Cont.)
74. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Two-tailed Tests
75. Hypotheses H0: m = A
Ha: m ≠ A
• Level of significance: need to split into two areas and put
a/2 in each tail
• Critical value: have both an upper and lower value, zcu and zcl
• Test Statistic: same as before
• p-value: once found in the normal table, multiply by 2 to get
the correct value
• Rejection Rule
Reject H0 if |zstat| > za/2 or p-value < a
Two-tailed Tests
76. x
m = A
Reject H0
z
zcL 0
a/2
zcu
Reject H0
a/2
cL cu
Two-tailed Tests (Cont.)
77. • Form of Hypotheses
H0: m = A
Ha: m ≠ A
• We can conduct a two-tailed test of a population mean
simply by constructing a confidence interval around the
mean of a sample.
• If the confidence interval contains the hypothesized
value for m, do not reject H0. Otherwise, reject H0.
Two-tailed Tests and Interval Estimation
78. • Test Statistic when s replaces s
This test statistic has a t distribution with n - 1 degrees of
freedom (used for small samples).
• Rejection Rule
One-Tailed Two-Tailed
(1) H0: m < A Reject H0 if tstat > tc
(2) H0: m > A Reject H0 if tstat < -tc
(3) H0: m = A Reject H0 if |tstat| > ta/2
n
s
x
/
m
−
tstat =
Using the t Distribution
79. • The t distribution table in most statistics books does not
have sufficient detail to determine the exact p-value for a
hypothesis test.
• We could use the t distribution table to identify an
approximate p-value.
• Computer software packages can provide the exact p-
value for the t distribution.
P-values and the t Distribution
80. • Relying only on data summaries may be misleading. Always visually explore your data.
• Widely used measures of association include covariance and correlation.
• Correlation does not imply causation!
• The Normal distribution plays a crucial role in the theory of sampling and is widely used in statistical
inference.
• The Rule 68-95-99.7 is applied to remember the percentage of values that lie within an interval
estimate of the Normal distribution.
• The Central Limit Theorem (CLT) establishes that, in the case the sample size is large enough, the
sampling distribution of the sample proportion will be approximately Normal.
• According to the Law of Large Numbers (LLN), as the sample size increases, the sampling error tends
to decrease.
• The p-value measures the probability that, if the null hypothesis H0 is true, we would randomly
produce a sample result at least as unlikely as the sample result that we actually produce.
• If the p-value is less than 𝛼, we then reject the null hypothesis.
• Type I Error is when a true null hypothesis is rejected, and Type II Error is when a false null
hypothesis is accepted.
Takeaways
81. References
• Brooks, C. (2019). Introductory Econometrics for Finance. Cambridge
University Press.
• Evans, J. R., Olson, D. L., & Olson, D. L. (2007). Statistics, Data Analysis,
and Decision Modeling. New Jersey: Pearson/Prentice Hall.
• Freed, N., Jones, S., & Bergquist, T. (2013). Understanding Business
Statistics. Wiley Global Education.
• Render, B., Stair Jr, R. M., Hanna, M. E., & Hale, T. S. (2018). Quantitative
Analysis for Management, 13e. Prentice Hall.
82. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Any Questions?
83. Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Thank You!