Descriptive statistics

Descriptive Statistics
• Descriptive statistics are used to describe the
basic features of the data in a study.
• They provide simple summaries about the
sample and the measures.
• Descriptive statistics are typically
distinguished from inferential statistics.
• With descriptive statistics you are simply
describing what is or what the data shows.
• With inferential statistics, you are trying to
reach conclusions that extend beyond the
immediate data alone.

• We use descriptive statistics simply to
describe what's going on in our data.
• Descriptive Statistics are used to present
quantitative descriptions in a manageable form.
• Descriptive statistics help us to simplify large
amounts of data in a sensible way.
• Descriptive statistics aims to summarize
a sample, rather than use the data to learn about
the population that the sample of data is thought
to represent.

• Even when a data analysis draws its main
conclusions using inferential statistics, descriptive
statistics are generally also presented.
• For example, in papers reporting on human
subjects, typically a table is included giving the
overall sample size, sample sizes in important
subgroups (e.g., for each treatment or exposure
group), and demographic or clinical characteristics
such as the average age, the proportion of subjects
of each sex, the proportion of subjects with
related comorbidities, etc.

• Some measures that are commonly used to
describe a data set are measures of
• Central tendency and
• Measures of variability
• Measures of central tendency include
the mean, median and mode,
• Measures of variability include the standard
deviation (or variance), the minimum and
maximum values of the
variables, kurtosis and skewness.

Measures of
Central Tendency
Measures of
Variability
1.Mean
2.Median
3.Mode
1.Range
2.Variance
3.Quartile
4.Standard Deviation

Measures of Central Tendency
Introduction
• A measure of central tendency is a single
value that attempts to describe a set of data by
identifying the central position within that set of
data.
• Measures of central tendency are
sometimes called measures of central location.
• They are also called summary statistics.

Introduction
• The mean (often called the average) is most
likely the measure of central tendency that you
are most familiar with, but there are others, such
as the median and the mode.
• The mean, median and mode are all valid
measures of central tendency, but under different
conditions, some measures of central tendency
become more appropriate to use than others.

Mean (Arithmetic)
• The mean (or average) is the most
popular and well known measure of central
tendency.
• It can be used with both discrete and
continuous data, although its use is most
often with continuous data.
• The mean is equal to the sum of all the
values in the data set divided by the number
of values in the data set.

Mean (Arithmetic)
• If we have n values in a data set and
they have values x1, x2, ..., xn, the sample
mean, usually denoted by (pronounced x
bar), is:

Mean (Arithmetic)
• This formula is usually written in a
slightly different manner using the Greek
capitol letter, , pronounced "sigma",
which means "sum of...":

• Why have we called it a sample mean?
This is because, in statistics, samples and
populations have very different meanings
and these differences are very important,
even if, in the case of the mean, they are
calculated in the same way.
• To acknowledge that we are calculating
the population mean and not the sample
mean, we use the Greek lower case letter
"mu", denoted as µ:

Median
• The median is the middle score for a
set of data that has been arranged in order
of magnitude.
• The median is less affected by outliers
and skewed data. In order to calculate the
median, suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92

Median
• We first need to rearrange that data
into order of magnitude (smallest first):
•Our median mark is the middle mark - in
this case, 56 (highlighted in Red). It is the
middle mark because there are 5 scores
before it and 5 scores after it.
14 35 45 55 55 56 56 65 87 89 92

Mode
• The mode is the most frequent score in
our data set.
• On a histogram it represents the
highest bar in a bar chart or histogram.
• You can, therefore, sometimes consider
the mode as being the most popular option.

Mode
An example of a mode is presented below:

Mode
Normally, the mode is used for categorical data where we wish to
know which is the most common category, as illustrated below:

Mode
• We are now stuck as to which mode best describes the
central tendency of the data.
• This is particularly problematic when we have continuous
data because we are more likely not to have any one value that is
more frequent than the other.
• For example, consider measuring 30 peoples' weight (to
the nearest 0.1 kg). How likely is it that we will find two or more
people with exactly the same weight (e.g., 67.4 kg)? The answer, is
probably very unlikely - many people might be close, but with such
a small sample (30 people) and a large range of possible weights,
you are unlikely to find two people with exactly the same weight;
that is, to the nearest 0.1 kg. This is why the mode is very rarely
used with continuous data.

• Summary of when to use the mean, median and mode
• Please use the following summary table to know what the best
measure of central tendency is with respect to the different types of
variable.
Type of Variable
Best measure of central
tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median

Measures Variability or Spread or Dispersion
• These are ways of summarizing a group of data by
describing how spread out the scores are.
• For example, the mean score of our 100 students may
be 65 out of 100. However, not all students will have
scored 65 marks. Rather, their scores will be spread out.
• Some will be lower and others higher.
• Measures of spread help us to summarize how spread
out these scores are.
• To describe this spread, a number of statistics are
available to us, including the range, quartiles, absolute
deviation, variance and standard deviation.

• Variability is the extent to which data points in a
statistical distribution or data set diverge from the
average, or mean, value as well as the extent to which
these data points differ from each other.

• The simplest measure of dispersion is the range.
• This tells us how spread out our data is.
• In order to calculate the range, you subtract the
smallest number from the largest number. Just like the
mean, the range is very sensitive to outliers.
• The variance is a measure of the average distance
that a set of data lies from its mean.
• The variance is not a stand-alone statistic.
• It is typically used in order to calculate other
statistics, such as the standard deviation.
• The higher the variance, the more spread out your
data are.

• There are four steps to calculate the variance:
1. Calculate the mean.
2. Subtract the mean from each data value. This
tells you how far each value lies from the mean.
3. Square each of the values so that you now have
all positive values, then find the sum of the
squares.
4. Divide the sum of the squares by the total
number of data in the set.

• The standard deviation is the most popular measure
of dispersion.
• It provides an average distance of the data set from
the mean.
• Like the variance, the higher the standard deviation,
the more spread out your data are.
• Unlike the variance, the standard deviation is
measured in the same unit as the original data, which
makes it easier to interpret.
• It is calculated by finding the square root of the
variance.

Descriptive statistics

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Descriptive statistics

Semelhante a Descriptive statistics (20)

Mais de Sarfraz Ahmad

Mais de Sarfraz Ahmad (20)

Último

Último (20)

Descriptive statistics