SlideShare uma empresa Scribd logo
1 de 83
Baixar para ler offline
Fundamentals of
Quantitative Research
Methods and Data Analytics
BUS159
Week 2
1. Data visualisation
2. Covariance and correlation
3. Normal distribution
4. Central limit theorem (CLT)
5. Law of large numbers (LLN)
6. Hypothesis testing
Content
Assessment Profile
Mid-term Coursework Assignment: 30% of the overall mark
▪ List of five exercises to be performed remotely within a 24-hours
period.
▪ Deadline: 18/03/2022 at 10:00am
Final Coursework Assignment: 70% of the overall mark
▪ Report showing a competent application of quantitative methods
and data analysis concepts learned in our module, exploring a topic
of your own interest.
▪ Word limit: 2000 words.
▪ Deadline: 22/04/2022 at 10:00am
Final Coursework Assignment 70%
1. Instructions and Guidance
• In this Report (2000 words), please proceed as follows:
• Select a topic which you are really interested exploring.
In the case you want me to select your, that is completely fine, please just inform
me about this and I will provide you a topic to be explored.
• Decide which research question are you going to address.
• Collect data related to your topic and research question.
• Decide which quantitative research method(s) are you going to adopt.
• Perform data analyses applying quantitative research method(s) learnt in this
module on your data using R/ R Studio.
• Detail the method(s) adopted and discuss your findings in your individual Report.
Final Coursework Assignment 70% (Cont.)
The structure of this Report should consist of the following brief sections:
• Section 1. Introduction: Briefly mention your topic, question, input data, and
analyses performed;
• Section 2. Data: Detail your dataset, including data source, temporal coverage,
sample size;
• Section 3. Results: Describe the quantitative research methods adopted and data
analyses performed, reporting your results using a complementary chart and
table, discussing your findings;
• Section 4. Conclusion: Summarise your Report, briefly describing the main
quantitative research method adopted as well as your most relevant/ interesting
finding.
• Appendix. Attach an image/ figure (e.g. code print screen) evidencing that you
performed your data analyses using R/ R Studio.
Final Coursework Assignment 70% (Cont.)
2. Assessment Rubric with Weighted Criteria
• Following the structure of the Report, five rubrics are assessed, each
item contributing with its respective weight to this coursework
assignment overall mark (totalling 100 points), as follows:
• Section 1. Introduction – weight: 15% of the coursework assignment overall
mark;
• Section 2. Data – weight: 20% of the coursework assignment overall mark;
• Section 3. Results – weight: 40% of the coursework assignment overall mark;
• Section 4. Conclusion – weight: 15% of the coursework assignment overall mark;
• Appendix – weight: 10% of the coursework assignment overall mark.
Final Coursework Assignment 70% (Cont.)
3. Assessment Criterion
This Report adopts the following undergraduate (UG) performance
thresholds:
• “Exceeds expectations” at equivalent of 60 or more points;
• “Meets expectations” at equivalent between 40 and 59 points;
• “Does not meet expectations” at equivalent of 39 or less points.
Defining Statistics
• Statistics has the power to turn raw data into information which may
effectively support the decision-making process.
In fact, this is the life’s blood of any modern business.
• It is a crucial linking the chain connecting data to information,
information to knowledge, and knowledge to action/ decision.
They are all part of an informed decision process.
• “Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write” (H. G. Wells)
Statistics is the art and science of collecting, analysing,
interpreting and presenting data aiming at transforming that
data into useful information
Stats Recap
Stats Recap
• Can you remember the following basic statistics concepts?
▪ p-value?
▪ Hypothesis testing?
▪ Type I and II errors?
▪ Normal distribution?
▪ Central limit theorem?
▪ Covariance and correlation?
• Understanding elementary statistics is crucial to navigate through most
of quantitative research methods and data analytics.
• That is the reason we are reviewing some key concepts in statistics.
Defining Statistics
• Statistics has the power to turn raw data into information which may
effectively support the decision-making process.
In fact, this is the life’s blood of any modern business.
• It is a crucial linking the chain connecting data to information,
information to knowledge, and knowledge to action/ decision.
They are all part of an informed decision process.
• “Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write” (H. G. Wells)
Statistics is the art and science of collecting, analysing,
interpreting and presenting data aiming at transforming that
data into useful information
Data Visualisation
Data Visualisation
• Never trust summary statistics alone.
• Always visually explore your data.
• Relying only on data summaries (e.g. mean, standard deviation,
correlations) may be misleading because wildly different datasets may
give similar – if not identical – results.
• This is a principle that has been demonstrated for decades, for instance
through the Anscombe’s Quartet (1973).
Source 1: https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html
Source 2: http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
• Anscombe’s quartet comprises four datasets that have nearly identical
simple descriptive statistics, yet have very different distributions and
appear very different when graphed.
• Each dataset consists of eleven (𝑥, 𝑦) points.
• They were constructed to demonstrate both the importance of
graphing data before analysing it.
• It was created in to oppose to the impression “numerical calculations
are exact, but graphs are rough”.
Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Data Visualisation (Cont.)
Dataset I Dataset II Dataset III Dataset IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Data Visualisation (Cont.)
Statistic/ Property Value Accuracy
Mean of 𝑥 9 Exact
Variance of 𝑥: 𝑠𝑥
2
11 Exact
Mean of 𝑦 7.50 Two decimal places
Variance of 𝑦: 𝑠𝑦
2
4.125 ±0.003
Correlation between 𝑥 and 𝑦 0.816 Three decimal places
Linear regression line 𝑦 = 3.00 + 0.500𝑥 Two and three decimal places, respectively
Coefficient of determination: 𝑅2
0.67 Two decimal places
Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Data Visualisation (Cont.)
• All four Anscombe’s quartet datasets yield the following statistical
measures:
• Identical summary statistics but radically different charts:
Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Data Visualisation (Cont.)
• In addition, it is also possible to generate bivariate data with a given
mean, median, and correlation in virtually any shape, from a circle to a
star to a dinosaur, as follows:
Source 1: https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html
Source 2: http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
Data Visualisation (Cont.)
Defining Statistics
• Statistics has the power to turn raw data into information which may
effectively support the decision-making process.
In fact, this is the life’s blood of any modern business.
• It is a crucial linking the chain connecting data to information,
information to knowledge, and knowledge to action/ decision.
They are all part of an informed decision process.
• “Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write” (H. G. Wells)
Statistics is the art and science of collecting, analysing,
interpreting and presenting data aiming at transforming that
data into useful information
Covariance and
Correlation
• Measures of association and related data visualisation techniques
include the following:
• Covariance
• Correlation
• Scatter diagram and trendline
• These show the degree of association or relationship between two
variables, but do not imply causation.
• The behaviour of one does not necessarily cause the behaviour of the
other.
Measures of Association
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Covariance
Covariance
• Positive values indicate a positive linear relationship
• Negative values indicate a negative linear relationship
• If the dataset refers to a sample, the covariance is denoted by Cov 𝑥, 𝑦
• The covariance may be calculated as follows:
Cov 𝑥, 𝑦 =
σ 𝑥𝑖 − ҧ
𝑥 𝑦𝑖 − ത
𝑦
𝑛 − 1
• If the dataset refers to a population, the covariance is then calculated as
follows:
Cov 𝑥, 𝑦 =
σ 𝑥𝑖 − 𝜇𝑥 𝑦𝑖 − 𝜇𝑦
𝑁
Covariance (Cont.)
• In (a), an upward sloping line best describes the points, indicating a
positive covariance.
• In (b), the downward sloping line implies a negative covariance.
• In (c), the line has 0 slope, which means a covariance of 0.
x
y
(a) Positive
x
y
(b) Negative
x
y
(c) Zero
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Correlation
Correlation
• The coefficient is a standardized measure (no units) and takes on
values between −1 and +1.
• Values near −1 suggest a strong negative linear relationship.
• Values near +1 suggest a strong positive linear relationship.
• If the datasets are samples, the coefficient is denoted by 𝑟𝑥𝑦, as follows:
𝑟𝑥𝑦 =
Cov 𝑥, 𝑦
𝑠𝑥𝑠𝑦
• If the datasets are populations, the coefficient is denoted by 𝑟𝑥𝑦, as
follows:
𝑟𝑥𝑦 =
Cov 𝑥, 𝑦
𝜎𝑥𝜎𝑦
Correlation (Cont.)
• The formula 𝑟𝑥𝑦 =
𝐶𝑜𝑣 𝑥,𝑦
𝑠𝑥𝑠𝑦
may be alternatively understood as follows:
𝑟𝑥𝑦 =
Amount that 𝑥 and 𝑦 vary together
Total variability in 𝑥 and 𝑦
• Correlation measures the strength of the relationship between two
variables.
• It aims answering the following question:
▪ When 𝑥 gets larger, does 𝑦 consistently get larger (or smaller)?
• Often measured with Pearson’s correlation coefficient:
▪ Commonly called “correlation coefficient” or even only “correlation”
▪ Almost always represented with the letter 𝑟
Correlation: Well-known Examples
Correlation: Well-known Examples
Correlation: Well-known Examples
Correlation: Well-known Examples
Scatter Diagram
• Also known as scatter plot or 𝑥-𝑦 graph.
• Scatter diagram graphs pair numerical data, with one variable on each
axis, to verify a relationship between them.
• If the variables are correlated, the points will fall along a line or curve.
• The better the correlation, the tighter the points will hug the line.
Source: http://www.tylervigen.com/spurious-correlations
Scatter Diagram: When to Use
• When we have paired numerical data
• When the dependent variable may have multiple values for each value
of the independent variable.
• When trying to determine whether the two variables are related, such
as:
▪ When trying to identify potential root causes of problems.
▪ After considering causes and effects to determine objectively whether a
particular cause and effect are related.
▪ When determining whether two effects that appear to be related both
occur with the same cause.
▪ When checking for autocorrelation.
Source: https://asq.org/quality-resources/scatter-diagram
Scatter Diagram Considerations
• Even if the scatter diagram shows a relationship, do not assume that one
variable caused the other. Both may be influenced by a third variable.
• When the data are plotted, the more the diagram resembles a straight line,
the stronger the relationship.
• If a line is not clear, statistical measures determine whether there is
reasonable certainty that a relationship exists.
• If the statistics say that no relationship exists, the pattern could have
occurred by random chance.
• If the diagram shows no relationship, consider whether the independent
(𝑥-axis) variable has been varied widely.
• Sometimes a relationship is not apparent because the data do not cover a
wide enough range.
Source: https://asq.org/quality-resources/scatter-diagram
Examples: Scatter Diagram
Source: https://techqualitypedia.com/scatter-diagram/
Defining Statistics
• Statistics has the power to turn raw data into information which may
effectively support the decision-making process.
In fact, this is the life’s blood of any modern business.
• It is a crucial linking the chain connecting data to information,
information to knowledge, and knowledge to action/ decision.
They are all part of an informed decision process.
• “Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write” (H. G. Wells)
Statistics is the art and science of collecting, analysing,
interpreting and presenting data aiming at transforming that
data into useful information
Normal
Distribution
Distributions
• A distribution is simply a collection of data or scores (e.g. z-score, t
score) of a variable.
• The values of a distribution are commonly ordered (e.g. from smallest
to largest).
• Distributions are commonly depicted using data visualisation tools
(e.g. charts).
• A probability distribution is a mathematical function that calculates the
probability of possible outcomes.
• Real-world data can or cannot follow a particular established
theoretical distribution (i.e. theoretical distribution vs data
distribution).
A Simplified Map of Popular Distributions
Source: https://medium.com/mytake/understanding-different-types-of-distributions-you-will-encounter-as-a-data-scientist-27ea4c375eec
Normal Distribution
• The Normal (also know and Gaussian) probability distribution is the most
important distribution for describing a continuous random variable in
statistics.
• It plays a crucial role in the theory of sampling and is widely used in statistical
inference.
• Many natural phenomena have patterns that resemble the normal
distribution (e.g. body weight, shoe size, IQ, etc).
• Many statistics are based on the assumption of normality.
• In terms of parameters, the Normal distribution contains a mean 𝜇 and
a variance 𝜎2
(and, consequently, standard deviation 𝜎), determining
the centre and width of the distribution.
• The highest point on the Normal curve is at the mean, which is also the
median and the mode.
• The standard deviation determines the width of the curve.
Thus, larger values result in wider, flatter curves.
• The Normal curve is symmetric. Therefore, 0.5 probability to the left of the
mean and 0.5 probability to the right.
Normal Distribution (Cont.)
Graph of the Normal Distribution
• The shape of the Normal distribution resembles a shape of a bell (i.e. bell-
shaped curve).
𝜇 = mean
x
f(x)
Standard Normal Distribution
• A random variable that has a normal distribution with a mean of zero
and a standard deviation of one is said to have a standard normal
probability distribution.
• The letter z is commonly used to designate the standard normal
random variable. More specifically, a z-score.
• We calculate z-scores for a Normal distribution as follows:
𝑧 =
𝑥 − 𝜇
𝜎
• Intuitively, we may think of z as a measure of the number of standard
deviations that 𝑥 is distant from 𝜇.
Normal Table Applications
• We may use the standard Normal distribution in two ways, namely
forward and in reverse.
• Forward:
▪ For a given data value 𝑥, calculate 𝑧 and find the probability, or
area, associated with 𝑧.
• In reverse:
▪ For a given probability or area, find 𝑧 and then calculate the data
value 𝑥 associated with that area using the following formula:
𝑥 = 𝜇 + 𝑧𝜎
Standard Normal Distribution (𝒛 of +1.2)
0
z
0.8849
0.8849
1.2
Standard Normal Table
𝑧 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
1.0 0.841 0.844 0.846 0.849 0.851 0.853 0.855 0.858 0.86 0.862
1.1 0.864 0.867 0.869 0.871 0.873 0.875 0.877 0.879 0.881 0.883
1.2 0.885 0.887 0.889 0.891 0.893 0.894 0.896 0.898 0.9 0.902
1.3 0.903 0.905 0.907 0.908 0.91 0.912 0.913 0.915 0.916 0.918
1.4 0.919 0.921 0.922 0.924 0.925 0.927 0.928 0.929 0.931 0.932
1.5 0.933 0.935 0.936 0.937 0.938 0.939 0.941 0.942 0.943 0.944
1.6 0.945 0.946 0.947 0.948 0.95 0.951 0.952 0.953 0.954 0.955
1.7 0.955 0.956 0.957 0.958 0.959 0.96 0.961 0.962 0.963 0.963
1.8 0.964 0.965 0.966 0.966 0.967 0.968 0.969 0.969 0.97 0.971
1.9 0.971 0.972 0.973 0.973 0.974 0.974 0.975 0.976 0.976 0.977
• This means that 0.885 (or 88.5%) of the data falls between 𝑧 = −∞ and
𝑧 = 1.2.
Standard Normal Distribution (𝒛 of +1.2)
0
z
0.8849
0.8849
1.2
• It also means that 1 − 0.885 (or 11.5%) of the data falls between 𝑧 =
1.2 and 𝑧 = +∞ (i.e. orange area in the figure above).
Traditional Standard Normal Table
Source: https://itfeature.com/statistical-tables/standard-normal-table
Rule 68-95-99.7 (or Empirical Rule)
• The Rule 68-95-99.7 (or Empirical Rule) is applied to remember the
percentage of values that lie within an interval estimate of the Normal
distribution.
• This rule works only with the Normal distribution.
• Approximately 68.3% of the data values will be within one standard
deviation of the mean.
• Approximately 95.5% of the data values will be within two standard
deviations of the mean.
• Approximately 99.7% (i.e. almost all) of the data values will be within
three standard deviations of the mean.
Rule 68-95-99.7 (or Empirical Rule)
Source: https://graphworkflow.com/eda/normality/
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Central Limit
Theorem (CLT)
• This is one of the most important theorems in statistics.
• A sample size of 𝑛 ≥ 30 is considered large.
• Whenever the population has a normal distribution, the sampling
distribution of the sample mean has a normal distribution for any
sample size.
As sample size increases, the sampling distribution of the
sample mean rapidly approaches the bell shape of a normal
distribution, regardless of the shape of the parent population.
In small sample cases (𝑛 < 30), the sampling distribution of
sample mean will be normal so long as the parent
population is normal.
Central Limit Theorem (CLT)
n = 2
n = 5
n = 30
𝑥
Population Shapes
The Sampling Distribution of the Sample Mean
𝑥 𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
ҧ
𝑥
⋮
Central Limit Theorem (CLT) (Cont.)
Population Distribution Sampling Distribution
n = 2
n = 20
n = 8
Sampling Distribution of the Sample Mean for Samples of Size n =
2, n = 8, and n = 20 Selected from the Same Population
𝑥 ҧ
𝑥
Sampling Distribution of the Sample Mean
ҧ
𝑥
Shape of the Sampling Distribution
When Sample Size is Large (n > 30)
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Law of Large
Numbers (LLN)
Law of Large Numbers (LLN)
• The law of large numbers (LLN) consists of a theorem that describes
the result of performing the same experiment a large number of times.
• According to the LLN, the mean of the results obtained from a large
number of trials should be close to the expected value and tends
towards the expected value as more trials are performed.
• The LLN is relevant due to the fact that it guarantees stable long-term
results for the averages of some random events.
• For example, while a casino may lose money in a single spin of the
roulette wheel, its earnings will tend towards a predictable percentage
over a large number of spins.
Source: https://en.wikipedia.org/wiki/Law_of_large_numbers
Law of Large Numbers (LLN) (Cont.)
• Any winning streak by a player will eventually be overcome by the
parameters of the game.
• Importantly, the LLN only applies when a large number of observations
is considered.
• There is no principle that a small number of observations will coincide
with the expected value or that a streak of one value will immediately
be “balanced” by the others (e.g. gambler’s fallacy).
• The LLN only applies to the mean value, as follows:
lim
𝑛→∞
෍
𝑖=1
𝑛
𝑋𝑖𝑛−1 − ത
𝑋 = 0
Source: https://en.wikipedia.org/wiki/Law_of_large_numbers
Law of Large Numbers (LLN) (Cont.)
• An illustration of the law of large
numbers using a particular run
of rolls of a single die.
• As the number of rolls in this run
increases, the average of the
values of all the results
approaches 3.5.
• Although each run would show a
distinctive shape over a small
number of throws (at the left),
over a large number of rolls (to
the right) the shapes would be
extremely similar.
Source: https://en.wikipedia.org/wiki/Law_of_large_numbers
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Statistical
Hypothesis Testing
• In hypothesis testing, a statement – call it a hypothesis – is made
about some characteristic of a particular population.
• A sample is then taken in an effort to establish whether or not
the statement is true.
• If the sample produces results that would be highly unlikely
under an assumption that the statement is true, then we’ll
conclude that the statement is false.
▪ The null hypothesis H0 is the statement to be tested.
▪ The alternative hypothesis Ha is the opposite of what is
stated in the null hypothesis.
The Nature of Hypothesis Testing
• The status quo or “if-it’s-not-broken-don’t-fix-it”
approach:
• Here the status quo (no change) position serves as the
null hypothesis.
• Compelling sample evidence to the contrary would
have to be produced before we’d conclude that a
change in prevailing conditions has occurred.
• This approach usually involves a decision that needs to
be made if the null hypothesis is rejected.
• Example:
• H0: The machine continues to function properly.
• Ha: The machine is not functioning properly
Establishing the Hypotheses
• The skeptic’s approach:
• Here, in order to test claims of “new and improved” or
“better than” or “different from” what is currently the
case, the null hypothesis would reflect the skeptic’s
view which essentially says that “new is no better
than old.”
• This testing is essentially proof by contradiction.
• Example:
• H0: A proposed new headache remedy is no faster
than other commonly used treatments.
• Ha: A proposed new headache remedy is faster than
other commonly used treatments.
Establishing the Hypotheses (Cont.)
Standard Forms for the Null and Alternative Hypotheses
A hypothesis test for a population mean m will take one of the following
three forms (where A represents the boundary value for the null
position):
H0: m > A H0: m < A H0: m = A
Ha: m < A Ha: m > A Ha: m ≠ A
One-tailed
(lower tail)
One-tailed
(upper tail)
Two-tailed
Establishing the Hypotheses (Cont.)
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Developing a One-
tailed Test
• If we’re going to set a boundary to separate “likely” from
“unlikely” sample results in the null sampling
distribution, we’ll need to define just what is meant by
“unlikely.”
• The value we choose, most commonly .05 or 5%, we’ll
label a and refer to it as the significance level of the test
A significance level a is the probability value that
defines just what we mean by unlikely sample
results under an assumption that the null
hypothesis is true (as an equality)
Choosing a Significance Level
• Use the standard normal table to find the z value with an
area of a in the lower (or upper) tail of the distribution.
• The value of z that establishes the boundary of the reject
H0 region the critical value for the test, zc.
• To conduct the test, calculate the test statistic, zstat.
• Decision rule (one-tail):
• Lower tail: Reject H0 if zstat < -zc
• Upper tail: Reject H0 if zstat > zc
Establishing a Decision Rule
For a = 0.05
z
Reject H0
Do Not Reject H0
a = 0.05
Zc = -1.65 0
One-tailed (Lower) Test about a Population Mean
For a = 0.05
z
Reject H0
Do Not Reject H0
a = 0.05
Zc = 1.65
0
One-tailed (Lower) Test about a Population Mean
• Failing to reject a null hypothesis shouldn’t be taken to
mean that we necessarily agree that the claim is true.
• We’re simply concluding that there’s not enough sample
evidence to convince us that it’s false.
• It’s for this reason we’ve chosen to use the phrase “fail to
reject” rather than “accept” the claim.
• The court system gives us a good example of this
distinction.
Failing to convict a defendant doesn’t necessarily
mean that the jury believes the defendant is innocent.
It simply means that, in the jury’s judgment, there’s
not strong enough evidence to convince them to reject
that possibility.
Accepting vs Failing to Reject the Null Hypothesis
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
p-values
The p-value can be used to make the decision in a hypothesis test.
The p-value measures the probability that, if the
null hypothesis is true (as an equality), we would
randomly produce a sample result at least as
unlikely as the sample result that we actually
produce
p-value Decision Rule
If the p-value is less than a, reject the null hypothesis
P-values
Step 1: State the null and alternative hypotheses.
Step 2: Choose a test statistic and a significance level for the test.
Step 3: Compute the value of the test statistic from your sample data.
Step 4: Apply the appropriate decision rule and make your decision.
Critical value version: Use the significance level to establish the
critical value for the test statistic. If the test statistic is outside the
critical value, reject the null hypothesis.
P-value version: Use the test statistic to determine the p-value for
the sample result. If the p-value is less than a, the significance level
of the test, reject the null hypothesis.
Generalising the Test Procedure
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Type I and II Errors
Whenever we make a judgment about a population parameter
based on sample information, there’s a chance we could be
wrong. In hypothesis testing, in fact, we can identify two
types of potential errors.
a, the significance level, measures the maximum
probability of making a Type I error.
Type I Error: Rejecting a true null hypothesis
Type II Error: Accepting a false null hypothesis
The Possibility of Error
• Type I Error: In hypothesis testing, we control for the risk of making a
Type I error when we set the value of a.
• Type II Error: Measuring and controlling for the risk of making a Type II
error, denoted by b, is more difficult.
• Statisticians avoid the risk of making a Type II error by using “do not
reject H0” instead off “accept H0”.
• Choosing a Significance Level: If the cost of a Type I error is high, we’ll
want to use a relatively small a in order to keep the risk low.
The Possibility of Error (Cont.)
Source 1: https://stats.stackexchange.com/questions/471603/type-1-and-type-2-error
Source 2: https://corporatefinanceinstitute.com/resources/knowledge/other/hypothesis-testing/
Visualisation: Type I and II Errors
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Two-tailed Tests
Hypotheses H0: m = A
Ha: m ≠ A
• Level of significance: need to split into two areas and put
a/2 in each tail
• Critical value: have both an upper and lower value, zcu and zcl
• Test Statistic: same as before
• p-value: once found in the normal table, multiply by 2 to get
the correct value
• Rejection Rule
Reject H0 if |zstat| > za/2 or p-value < a
Two-tailed Tests
x
m = A
Reject H0
z
zcL 0
a/2
zcu
Reject H0
a/2
cL cu
Two-tailed Tests (Cont.)
• Form of Hypotheses
H0: m = A
Ha: m ≠ A
• We can conduct a two-tailed test of a population mean
simply by constructing a confidence interval around the
mean of a sample.
• If the confidence interval contains the hypothesized
value for m, do not reject H0. Otherwise, reject H0.
Two-tailed Tests and Interval Estimation
• Test Statistic when s replaces s
This test statistic has a t distribution with n - 1 degrees of
freedom (used for small samples).
• Rejection Rule
One-Tailed Two-Tailed
(1) H0: m < A Reject H0 if tstat > tc
(2) H0: m > A Reject H0 if tstat < -tc
(3) H0: m = A Reject H0 if |tstat| > ta/2
n
s
x
/
m
−
tstat =
Using the t Distribution
• The t distribution table in most statistics books does not
have sufficient detail to determine the exact p-value for a
hypothesis test.
• We could use the t distribution table to identify an
approximate p-value.
• Computer software packages can provide the exact p-
value for the t distribution.
P-values and the t Distribution
• Relying only on data summaries may be misleading. Always visually explore your data.
• Widely used measures of association include covariance and correlation.
• Correlation does not imply causation!
• The Normal distribution plays a crucial role in the theory of sampling and is widely used in statistical
inference.
• The Rule 68-95-99.7 is applied to remember the percentage of values that lie within an interval
estimate of the Normal distribution.
• The Central Limit Theorem (CLT) establishes that, in the case the sample size is large enough, the
sampling distribution of the sample proportion will be approximately Normal.
• According to the Law of Large Numbers (LLN), as the sample size increases, the sampling error tends
to decrease.
• The p-value measures the probability that, if the null hypothesis H0 is true, we would randomly
produce a sample result at least as unlikely as the sample result that we actually produce.
• If the p-value is less than 𝛼, we then reject the null hypothesis.
• Type I Error is when a true null hypothesis is rejected, and Type II Error is when a false null
hypothesis is accepted.
Takeaways
References
• Brooks, C. (2019). Introductory Econometrics for Finance. Cambridge
University Press.
• Evans, J. R., Olson, D. L., & Olson, D. L. (2007). Statistics, Data Analysis,
and Decision Modeling. New Jersey: Pearson/Prentice Hall.
• Freed, N., Jones, S., & Bergquist, T. (2013). Understanding Business
Statistics. Wiley Global Education.
• Render, B., Stair Jr, R. M., Hanna, M. E., & Hale, T. S. (2018). Quantitative
Analysis for Management, 13e. Prentice Hall.
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Any Questions?
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Thank You!

Mais conteúdo relacionado

Semelhante a Week_2_Lecture.pdf

Final spss hands on training (descriptive analysis) may 24th 2013
Final spss  hands on training (descriptive analysis) may 24th 2013Final spss  hands on training (descriptive analysis) may 24th 2013
Final spss hands on training (descriptive analysis) may 24th 2013
Tin Myo Han
 
Statics for the management
Statics for the managementStatics for the management
Statics for the management
Rohit Mishra
 

Semelhante a Week_2_Lecture.pdf (20)

Introduction to statistics.pptx
Introduction to statistics.pptxIntroduction to statistics.pptx
Introduction to statistics.pptx
 
Data analysis and Interpretation
Data analysis and InterpretationData analysis and Interpretation
Data analysis and Interpretation
 
QUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptxQUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptx
 
Intro scikitlearnstatsmodels
Intro scikitlearnstatsmodelsIntro scikitlearnstatsmodels
Intro scikitlearnstatsmodels
 
Data science 101
Data science 101Data science 101
Data science 101
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
 
analysis plan.ppt
analysis plan.pptanalysis plan.ppt
analysis plan.ppt
 
Chapter 7 Knowing Our Data
Chapter 7 Knowing Our DataChapter 7 Knowing Our Data
Chapter 7 Knowing Our Data
 
Final spss hands on training (descriptive analysis) may 24th 2013
Final spss  hands on training (descriptive analysis) may 24th 2013Final spss  hands on training (descriptive analysis) may 24th 2013
Final spss hands on training (descriptive analysis) may 24th 2013
 
Week11-EvaluationMethods.ppt
Week11-EvaluationMethods.pptWeek11-EvaluationMethods.ppt
Week11-EvaluationMethods.ppt
 
Statics for the management
Statics for the managementStatics for the management
Statics for the management
 
Statics for the management
Statics for the managementStatics for the management
Statics for the management
 
Data analysis
Data analysisData analysis
Data analysis
 
Session 1 and 2.pptx
Session 1 and 2.pptxSession 1 and 2.pptx
Session 1 and 2.pptx
 
Chapter-Four.pdf
Chapter-Four.pdfChapter-Four.pdf
Chapter-Four.pdf
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatistics
 
Nursing Data Analysis.pptx
Nursing Data Analysis.pptxNursing Data Analysis.pptx
Nursing Data Analysis.pptx
 

Último

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 

Último (20)

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 

Week_2_Lecture.pdf

  • 1. Fundamentals of Quantitative Research Methods and Data Analytics BUS159 Week 2
  • 2. 1. Data visualisation 2. Covariance and correlation 3. Normal distribution 4. Central limit theorem (CLT) 5. Law of large numbers (LLN) 6. Hypothesis testing Content
  • 3. Assessment Profile Mid-term Coursework Assignment: 30% of the overall mark ▪ List of five exercises to be performed remotely within a 24-hours period. ▪ Deadline: 18/03/2022 at 10:00am Final Coursework Assignment: 70% of the overall mark ▪ Report showing a competent application of quantitative methods and data analysis concepts learned in our module, exploring a topic of your own interest. ▪ Word limit: 2000 words. ▪ Deadline: 22/04/2022 at 10:00am
  • 4. Final Coursework Assignment 70% 1. Instructions and Guidance • In this Report (2000 words), please proceed as follows: • Select a topic which you are really interested exploring. In the case you want me to select your, that is completely fine, please just inform me about this and I will provide you a topic to be explored. • Decide which research question are you going to address. • Collect data related to your topic and research question. • Decide which quantitative research method(s) are you going to adopt. • Perform data analyses applying quantitative research method(s) learnt in this module on your data using R/ R Studio. • Detail the method(s) adopted and discuss your findings in your individual Report.
  • 5. Final Coursework Assignment 70% (Cont.) The structure of this Report should consist of the following brief sections: • Section 1. Introduction: Briefly mention your topic, question, input data, and analyses performed; • Section 2. Data: Detail your dataset, including data source, temporal coverage, sample size; • Section 3. Results: Describe the quantitative research methods adopted and data analyses performed, reporting your results using a complementary chart and table, discussing your findings; • Section 4. Conclusion: Summarise your Report, briefly describing the main quantitative research method adopted as well as your most relevant/ interesting finding. • Appendix. Attach an image/ figure (e.g. code print screen) evidencing that you performed your data analyses using R/ R Studio.
  • 6. Final Coursework Assignment 70% (Cont.) 2. Assessment Rubric with Weighted Criteria • Following the structure of the Report, five rubrics are assessed, each item contributing with its respective weight to this coursework assignment overall mark (totalling 100 points), as follows: • Section 1. Introduction – weight: 15% of the coursework assignment overall mark; • Section 2. Data – weight: 20% of the coursework assignment overall mark; • Section 3. Results – weight: 40% of the coursework assignment overall mark; • Section 4. Conclusion – weight: 15% of the coursework assignment overall mark; • Appendix – weight: 10% of the coursework assignment overall mark.
  • 7. Final Coursework Assignment 70% (Cont.) 3. Assessment Criterion This Report adopts the following undergraduate (UG) performance thresholds: • “Exceeds expectations” at equivalent of 60 or more points; • “Meets expectations” at equivalent between 40 and 59 points; • “Does not meet expectations” at equivalent of 39 or less points.
  • 8. Defining Statistics • Statistics has the power to turn raw data into information which may effectively support the decision-making process. In fact, this is the life’s blood of any modern business. • It is a crucial linking the chain connecting data to information, information to knowledge, and knowledge to action/ decision. They are all part of an informed decision process. • “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write” (H. G. Wells) Statistics is the art and science of collecting, analysing, interpreting and presenting data aiming at transforming that data into useful information Stats Recap
  • 9. Stats Recap • Can you remember the following basic statistics concepts? ▪ p-value? ▪ Hypothesis testing? ▪ Type I and II errors? ▪ Normal distribution? ▪ Central limit theorem? ▪ Covariance and correlation? • Understanding elementary statistics is crucial to navigate through most of quantitative research methods and data analytics. • That is the reason we are reviewing some key concepts in statistics.
  • 10. Defining Statistics • Statistics has the power to turn raw data into information which may effectively support the decision-making process. In fact, this is the life’s blood of any modern business. • It is a crucial linking the chain connecting data to information, information to knowledge, and knowledge to action/ decision. They are all part of an informed decision process. • “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write” (H. G. Wells) Statistics is the art and science of collecting, analysing, interpreting and presenting data aiming at transforming that data into useful information Data Visualisation
  • 11. Data Visualisation • Never trust summary statistics alone. • Always visually explore your data. • Relying only on data summaries (e.g. mean, standard deviation, correlations) may be misleading because wildly different datasets may give similar – if not identical – results. • This is a principle that has been demonstrated for decades, for instance through the Anscombe’s Quartet (1973). Source 1: https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html Source 2: http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
  • 12. • Anscombe’s quartet comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. • Each dataset consists of eleven (𝑥, 𝑦) points. • They were constructed to demonstrate both the importance of graphing data before analysing it. • It was created in to oppose to the impression “numerical calculations are exact, but graphs are rough”. Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet Data Visualisation (Cont.)
  • 13. Dataset I Dataset II Dataset III Dataset IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet Data Visualisation (Cont.)
  • 14. Statistic/ Property Value Accuracy Mean of 𝑥 9 Exact Variance of 𝑥: 𝑠𝑥 2 11 Exact Mean of 𝑦 7.50 Two decimal places Variance of 𝑦: 𝑠𝑦 2 4.125 ±0.003 Correlation between 𝑥 and 𝑦 0.816 Three decimal places Linear regression line 𝑦 = 3.00 + 0.500𝑥 Two and three decimal places, respectively Coefficient of determination: 𝑅2 0.67 Two decimal places Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet Data Visualisation (Cont.) • All four Anscombe’s quartet datasets yield the following statistical measures:
  • 15. • Identical summary statistics but radically different charts: Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet Data Visualisation (Cont.)
  • 16. • In addition, it is also possible to generate bivariate data with a given mean, median, and correlation in virtually any shape, from a circle to a star to a dinosaur, as follows: Source 1: https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html Source 2: http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html Data Visualisation (Cont.)
  • 17. Defining Statistics • Statistics has the power to turn raw data into information which may effectively support the decision-making process. In fact, this is the life’s blood of any modern business. • It is a crucial linking the chain connecting data to information, information to knowledge, and knowledge to action/ decision. They are all part of an informed decision process. • “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write” (H. G. Wells) Statistics is the art and science of collecting, analysing, interpreting and presenting data aiming at transforming that data into useful information Covariance and Correlation
  • 18. • Measures of association and related data visualisation techniques include the following: • Covariance • Correlation • Scatter diagram and trendline • These show the degree of association or relationship between two variables, but do not imply causation. • The behaviour of one does not necessarily cause the behaviour of the other. Measures of Association
  • 19. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Covariance
  • 20. Covariance • Positive values indicate a positive linear relationship • Negative values indicate a negative linear relationship • If the dataset refers to a sample, the covariance is denoted by Cov 𝑥, 𝑦 • The covariance may be calculated as follows: Cov 𝑥, 𝑦 = σ 𝑥𝑖 − ҧ 𝑥 𝑦𝑖 − ത 𝑦 𝑛 − 1 • If the dataset refers to a population, the covariance is then calculated as follows: Cov 𝑥, 𝑦 = σ 𝑥𝑖 − 𝜇𝑥 𝑦𝑖 − 𝜇𝑦 𝑁
  • 21. Covariance (Cont.) • In (a), an upward sloping line best describes the points, indicating a positive covariance. • In (b), the downward sloping line implies a negative covariance. • In (c), the line has 0 slope, which means a covariance of 0. x y (a) Positive x y (b) Negative x y (c) Zero
  • 22. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Correlation
  • 23. Correlation • The coefficient is a standardized measure (no units) and takes on values between −1 and +1. • Values near −1 suggest a strong negative linear relationship. • Values near +1 suggest a strong positive linear relationship. • If the datasets are samples, the coefficient is denoted by 𝑟𝑥𝑦, as follows: 𝑟𝑥𝑦 = Cov 𝑥, 𝑦 𝑠𝑥𝑠𝑦 • If the datasets are populations, the coefficient is denoted by 𝑟𝑥𝑦, as follows: 𝑟𝑥𝑦 = Cov 𝑥, 𝑦 𝜎𝑥𝜎𝑦
  • 24. Correlation (Cont.) • The formula 𝑟𝑥𝑦 = 𝐶𝑜𝑣 𝑥,𝑦 𝑠𝑥𝑠𝑦 may be alternatively understood as follows: 𝑟𝑥𝑦 = Amount that 𝑥 and 𝑦 vary together Total variability in 𝑥 and 𝑦 • Correlation measures the strength of the relationship between two variables. • It aims answering the following question: ▪ When 𝑥 gets larger, does 𝑦 consistently get larger (or smaller)? • Often measured with Pearson’s correlation coefficient: ▪ Commonly called “correlation coefficient” or even only “correlation” ▪ Almost always represented with the letter 𝑟
  • 29. Scatter Diagram • Also known as scatter plot or 𝑥-𝑦 graph. • Scatter diagram graphs pair numerical data, with one variable on each axis, to verify a relationship between them. • If the variables are correlated, the points will fall along a line or curve. • The better the correlation, the tighter the points will hug the line. Source: http://www.tylervigen.com/spurious-correlations
  • 30. Scatter Diagram: When to Use • When we have paired numerical data • When the dependent variable may have multiple values for each value of the independent variable. • When trying to determine whether the two variables are related, such as: ▪ When trying to identify potential root causes of problems. ▪ After considering causes and effects to determine objectively whether a particular cause and effect are related. ▪ When determining whether two effects that appear to be related both occur with the same cause. ▪ When checking for autocorrelation. Source: https://asq.org/quality-resources/scatter-diagram
  • 31. Scatter Diagram Considerations • Even if the scatter diagram shows a relationship, do not assume that one variable caused the other. Both may be influenced by a third variable. • When the data are plotted, the more the diagram resembles a straight line, the stronger the relationship. • If a line is not clear, statistical measures determine whether there is reasonable certainty that a relationship exists. • If the statistics say that no relationship exists, the pattern could have occurred by random chance. • If the diagram shows no relationship, consider whether the independent (𝑥-axis) variable has been varied widely. • Sometimes a relationship is not apparent because the data do not cover a wide enough range. Source: https://asq.org/quality-resources/scatter-diagram
  • 32. Examples: Scatter Diagram Source: https://techqualitypedia.com/scatter-diagram/
  • 33. Defining Statistics • Statistics has the power to turn raw data into information which may effectively support the decision-making process. In fact, this is the life’s blood of any modern business. • It is a crucial linking the chain connecting data to information, information to knowledge, and knowledge to action/ decision. They are all part of an informed decision process. • “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write” (H. G. Wells) Statistics is the art and science of collecting, analysing, interpreting and presenting data aiming at transforming that data into useful information Normal Distribution
  • 34. Distributions • A distribution is simply a collection of data or scores (e.g. z-score, t score) of a variable. • The values of a distribution are commonly ordered (e.g. from smallest to largest). • Distributions are commonly depicted using data visualisation tools (e.g. charts). • A probability distribution is a mathematical function that calculates the probability of possible outcomes. • Real-world data can or cannot follow a particular established theoretical distribution (i.e. theoretical distribution vs data distribution).
  • 35. A Simplified Map of Popular Distributions Source: https://medium.com/mytake/understanding-different-types-of-distributions-you-will-encounter-as-a-data-scientist-27ea4c375eec
  • 36. Normal Distribution • The Normal (also know and Gaussian) probability distribution is the most important distribution for describing a continuous random variable in statistics. • It plays a crucial role in the theory of sampling and is widely used in statistical inference. • Many natural phenomena have patterns that resemble the normal distribution (e.g. body weight, shoe size, IQ, etc). • Many statistics are based on the assumption of normality.
  • 37. • In terms of parameters, the Normal distribution contains a mean 𝜇 and a variance 𝜎2 (and, consequently, standard deviation 𝜎), determining the centre and width of the distribution. • The highest point on the Normal curve is at the mean, which is also the median and the mode. • The standard deviation determines the width of the curve. Thus, larger values result in wider, flatter curves. • The Normal curve is symmetric. Therefore, 0.5 probability to the left of the mean and 0.5 probability to the right. Normal Distribution (Cont.)
  • 38. Graph of the Normal Distribution • The shape of the Normal distribution resembles a shape of a bell (i.e. bell- shaped curve). 𝜇 = mean x f(x)
  • 39. Standard Normal Distribution • A random variable that has a normal distribution with a mean of zero and a standard deviation of one is said to have a standard normal probability distribution. • The letter z is commonly used to designate the standard normal random variable. More specifically, a z-score. • We calculate z-scores for a Normal distribution as follows: 𝑧 = 𝑥 − 𝜇 𝜎 • Intuitively, we may think of z as a measure of the number of standard deviations that 𝑥 is distant from 𝜇.
  • 40. Normal Table Applications • We may use the standard Normal distribution in two ways, namely forward and in reverse. • Forward: ▪ For a given data value 𝑥, calculate 𝑧 and find the probability, or area, associated with 𝑧. • In reverse: ▪ For a given probability or area, find 𝑧 and then calculate the data value 𝑥 associated with that area using the following formula: 𝑥 = 𝜇 + 𝑧𝜎
  • 41. Standard Normal Distribution (𝒛 of +1.2) 0 z 0.8849 0.8849 1.2
  • 42. Standard Normal Table 𝑧 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1.0 0.841 0.844 0.846 0.849 0.851 0.853 0.855 0.858 0.86 0.862 1.1 0.864 0.867 0.869 0.871 0.873 0.875 0.877 0.879 0.881 0.883 1.2 0.885 0.887 0.889 0.891 0.893 0.894 0.896 0.898 0.9 0.902 1.3 0.903 0.905 0.907 0.908 0.91 0.912 0.913 0.915 0.916 0.918 1.4 0.919 0.921 0.922 0.924 0.925 0.927 0.928 0.929 0.931 0.932 1.5 0.933 0.935 0.936 0.937 0.938 0.939 0.941 0.942 0.943 0.944 1.6 0.945 0.946 0.947 0.948 0.95 0.951 0.952 0.953 0.954 0.955 1.7 0.955 0.956 0.957 0.958 0.959 0.96 0.961 0.962 0.963 0.963 1.8 0.964 0.965 0.966 0.966 0.967 0.968 0.969 0.969 0.97 0.971 1.9 0.971 0.972 0.973 0.973 0.974 0.974 0.975 0.976 0.976 0.977 • This means that 0.885 (or 88.5%) of the data falls between 𝑧 = −∞ and 𝑧 = 1.2.
  • 43. Standard Normal Distribution (𝒛 of +1.2) 0 z 0.8849 0.8849 1.2 • It also means that 1 − 0.885 (or 11.5%) of the data falls between 𝑧 = 1.2 and 𝑧 = +∞ (i.e. orange area in the figure above).
  • 44. Traditional Standard Normal Table Source: https://itfeature.com/statistical-tables/standard-normal-table
  • 45. Rule 68-95-99.7 (or Empirical Rule) • The Rule 68-95-99.7 (or Empirical Rule) is applied to remember the percentage of values that lie within an interval estimate of the Normal distribution. • This rule works only with the Normal distribution. • Approximately 68.3% of the data values will be within one standard deviation of the mean. • Approximately 95.5% of the data values will be within two standard deviations of the mean. • Approximately 99.7% (i.e. almost all) of the data values will be within three standard deviations of the mean.
  • 46. Rule 68-95-99.7 (or Empirical Rule) Source: https://graphworkflow.com/eda/normality/
  • 47. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Central Limit Theorem (CLT)
  • 48. • This is one of the most important theorems in statistics. • A sample size of 𝑛 ≥ 30 is considered large. • Whenever the population has a normal distribution, the sampling distribution of the sample mean has a normal distribution for any sample size. As sample size increases, the sampling distribution of the sample mean rapidly approaches the bell shape of a normal distribution, regardless of the shape of the parent population. In small sample cases (𝑛 < 30), the sampling distribution of sample mean will be normal so long as the parent population is normal. Central Limit Theorem (CLT)
  • 49. n = 2 n = 5 n = 30 𝑥 Population Shapes The Sampling Distribution of the Sample Mean 𝑥 𝑥 ҧ 𝑥 ҧ 𝑥 ҧ 𝑥 ҧ 𝑥 ҧ 𝑥 ҧ 𝑥 ҧ 𝑥 ҧ 𝑥 ҧ 𝑥 ⋮ Central Limit Theorem (CLT) (Cont.)
  • 50. Population Distribution Sampling Distribution n = 2 n = 20 n = 8 Sampling Distribution of the Sample Mean for Samples of Size n = 2, n = 8, and n = 20 Selected from the Same Population 𝑥 ҧ 𝑥 Sampling Distribution of the Sample Mean
  • 51. ҧ 𝑥 Shape of the Sampling Distribution When Sample Size is Large (n > 30)
  • 52. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Law of Large Numbers (LLN)
  • 53. Law of Large Numbers (LLN) • The law of large numbers (LLN) consists of a theorem that describes the result of performing the same experiment a large number of times. • According to the LLN, the mean of the results obtained from a large number of trials should be close to the expected value and tends towards the expected value as more trials are performed. • The LLN is relevant due to the fact that it guarantees stable long-term results for the averages of some random events. • For example, while a casino may lose money in a single spin of the roulette wheel, its earnings will tend towards a predictable percentage over a large number of spins. Source: https://en.wikipedia.org/wiki/Law_of_large_numbers
  • 54. Law of Large Numbers (LLN) (Cont.) • Any winning streak by a player will eventually be overcome by the parameters of the game. • Importantly, the LLN only applies when a large number of observations is considered. • There is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be “balanced” by the others (e.g. gambler’s fallacy). • The LLN only applies to the mean value, as follows: lim 𝑛→∞ ෍ 𝑖=1 𝑛 𝑋𝑖𝑛−1 − ത 𝑋 = 0 Source: https://en.wikipedia.org/wiki/Law_of_large_numbers
  • 55. Law of Large Numbers (LLN) (Cont.) • An illustration of the law of large numbers using a particular run of rolls of a single die. • As the number of rolls in this run increases, the average of the values of all the results approaches 3.5. • Although each run would show a distinctive shape over a small number of throws (at the left), over a large number of rolls (to the right) the shapes would be extremely similar. Source: https://en.wikipedia.org/wiki/Law_of_large_numbers
  • 56. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Statistical Hypothesis Testing
  • 57. • In hypothesis testing, a statement – call it a hypothesis – is made about some characteristic of a particular population. • A sample is then taken in an effort to establish whether or not the statement is true. • If the sample produces results that would be highly unlikely under an assumption that the statement is true, then we’ll conclude that the statement is false. ▪ The null hypothesis H0 is the statement to be tested. ▪ The alternative hypothesis Ha is the opposite of what is stated in the null hypothesis. The Nature of Hypothesis Testing
  • 58. • The status quo or “if-it’s-not-broken-don’t-fix-it” approach: • Here the status quo (no change) position serves as the null hypothesis. • Compelling sample evidence to the contrary would have to be produced before we’d conclude that a change in prevailing conditions has occurred. • This approach usually involves a decision that needs to be made if the null hypothesis is rejected. • Example: • H0: The machine continues to function properly. • Ha: The machine is not functioning properly Establishing the Hypotheses
  • 59. • The skeptic’s approach: • Here, in order to test claims of “new and improved” or “better than” or “different from” what is currently the case, the null hypothesis would reflect the skeptic’s view which essentially says that “new is no better than old.” • This testing is essentially proof by contradiction. • Example: • H0: A proposed new headache remedy is no faster than other commonly used treatments. • Ha: A proposed new headache remedy is faster than other commonly used treatments. Establishing the Hypotheses (Cont.)
  • 60. Standard Forms for the Null and Alternative Hypotheses A hypothesis test for a population mean m will take one of the following three forms (where A represents the boundary value for the null position): H0: m > A H0: m < A H0: m = A Ha: m < A Ha: m > A Ha: m ≠ A One-tailed (lower tail) One-tailed (upper tail) Two-tailed Establishing the Hypotheses (Cont.)
  • 61. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Developing a One- tailed Test
  • 62. • If we’re going to set a boundary to separate “likely” from “unlikely” sample results in the null sampling distribution, we’ll need to define just what is meant by “unlikely.” • The value we choose, most commonly .05 or 5%, we’ll label a and refer to it as the significance level of the test A significance level a is the probability value that defines just what we mean by unlikely sample results under an assumption that the null hypothesis is true (as an equality) Choosing a Significance Level
  • 63. • Use the standard normal table to find the z value with an area of a in the lower (or upper) tail of the distribution. • The value of z that establishes the boundary of the reject H0 region the critical value for the test, zc. • To conduct the test, calculate the test statistic, zstat. • Decision rule (one-tail): • Lower tail: Reject H0 if zstat < -zc • Upper tail: Reject H0 if zstat > zc Establishing a Decision Rule
  • 64. For a = 0.05 z Reject H0 Do Not Reject H0 a = 0.05 Zc = -1.65 0 One-tailed (Lower) Test about a Population Mean
  • 65. For a = 0.05 z Reject H0 Do Not Reject H0 a = 0.05 Zc = 1.65 0 One-tailed (Lower) Test about a Population Mean
  • 66. • Failing to reject a null hypothesis shouldn’t be taken to mean that we necessarily agree that the claim is true. • We’re simply concluding that there’s not enough sample evidence to convince us that it’s false. • It’s for this reason we’ve chosen to use the phrase “fail to reject” rather than “accept” the claim. • The court system gives us a good example of this distinction. Failing to convict a defendant doesn’t necessarily mean that the jury believes the defendant is innocent. It simply means that, in the jury’s judgment, there’s not strong enough evidence to convince them to reject that possibility. Accepting vs Failing to Reject the Null Hypothesis
  • 67. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. p-values
  • 68. The p-value can be used to make the decision in a hypothesis test. The p-value measures the probability that, if the null hypothesis is true (as an equality), we would randomly produce a sample result at least as unlikely as the sample result that we actually produce p-value Decision Rule If the p-value is less than a, reject the null hypothesis P-values
  • 69. Step 1: State the null and alternative hypotheses. Step 2: Choose a test statistic and a significance level for the test. Step 3: Compute the value of the test statistic from your sample data. Step 4: Apply the appropriate decision rule and make your decision. Critical value version: Use the significance level to establish the critical value for the test statistic. If the test statistic is outside the critical value, reject the null hypothesis. P-value version: Use the test statistic to determine the p-value for the sample result. If the p-value is less than a, the significance level of the test, reject the null hypothesis. Generalising the Test Procedure
  • 70. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Type I and II Errors
  • 71. Whenever we make a judgment about a population parameter based on sample information, there’s a chance we could be wrong. In hypothesis testing, in fact, we can identify two types of potential errors. a, the significance level, measures the maximum probability of making a Type I error. Type I Error: Rejecting a true null hypothesis Type II Error: Accepting a false null hypothesis The Possibility of Error
  • 72. • Type I Error: In hypothesis testing, we control for the risk of making a Type I error when we set the value of a. • Type II Error: Measuring and controlling for the risk of making a Type II error, denoted by b, is more difficult. • Statisticians avoid the risk of making a Type II error by using “do not reject H0” instead off “accept H0”. • Choosing a Significance Level: If the cost of a Type I error is high, we’ll want to use a relatively small a in order to keep the risk low. The Possibility of Error (Cont.)
  • 73. Source 1: https://stats.stackexchange.com/questions/471603/type-1-and-type-2-error Source 2: https://corporatefinanceinstitute.com/resources/knowledge/other/hypothesis-testing/ Visualisation: Type I and II Errors
  • 74. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Two-tailed Tests
  • 75. Hypotheses H0: m = A Ha: m ≠ A • Level of significance: need to split into two areas and put a/2 in each tail • Critical value: have both an upper and lower value, zcu and zcl • Test Statistic: same as before • p-value: once found in the normal table, multiply by 2 to get the correct value • Rejection Rule Reject H0 if |zstat| > za/2 or p-value < a Two-tailed Tests
  • 76. x m = A Reject H0 z zcL 0 a/2 zcu Reject H0 a/2 cL cu Two-tailed Tests (Cont.)
  • 77. • Form of Hypotheses H0: m = A Ha: m ≠ A • We can conduct a two-tailed test of a population mean simply by constructing a confidence interval around the mean of a sample. • If the confidence interval contains the hypothesized value for m, do not reject H0. Otherwise, reject H0. Two-tailed Tests and Interval Estimation
  • 78. • Test Statistic when s replaces s This test statistic has a t distribution with n - 1 degrees of freedom (used for small samples). • Rejection Rule One-Tailed Two-Tailed (1) H0: m < A Reject H0 if tstat > tc (2) H0: m > A Reject H0 if tstat < -tc (3) H0: m = A Reject H0 if |tstat| > ta/2 n s x / m − tstat = Using the t Distribution
  • 79. • The t distribution table in most statistics books does not have sufficient detail to determine the exact p-value for a hypothesis test. • We could use the t distribution table to identify an approximate p-value. • Computer software packages can provide the exact p- value for the t distribution. P-values and the t Distribution
  • 80. • Relying only on data summaries may be misleading. Always visually explore your data. • Widely used measures of association include covariance and correlation. • Correlation does not imply causation! • The Normal distribution plays a crucial role in the theory of sampling and is widely used in statistical inference. • The Rule 68-95-99.7 is applied to remember the percentage of values that lie within an interval estimate of the Normal distribution. • The Central Limit Theorem (CLT) establishes that, in the case the sample size is large enough, the sampling distribution of the sample proportion will be approximately Normal. • According to the Law of Large Numbers (LLN), as the sample size increases, the sampling error tends to decrease. • The p-value measures the probability that, if the null hypothesis H0 is true, we would randomly produce a sample result at least as unlikely as the sample result that we actually produce. • If the p-value is less than 𝛼, we then reject the null hypothesis. • Type I Error is when a true null hypothesis is rejected, and Type II Error is when a false null hypothesis is accepted. Takeaways
  • 81. References • Brooks, C. (2019). Introductory Econometrics for Finance. Cambridge University Press. • Evans, J. R., Olson, D. L., & Olson, D. L. (2007). Statistics, Data Analysis, and Decision Modeling. New Jersey: Pearson/Prentice Hall. • Freed, N., Jones, S., & Bergquist, T. (2013). Understanding Business Statistics. Wiley Global Education. • Render, B., Stair Jr, R. M., Hanna, M. E., & Hale, T. S. (2018). Quantitative Analysis for Management, 13e. Prentice Hall.
  • 82. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Any Questions?
  • 83. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Thank You!