SlideShare a Scribd company logo
1 of 10
Download to read offline
Fundamentals of some Basic Statistical Definitions

Basic Definitions                               Sample

                                                        A sample is a group of units
Statistical Inference
                                                selected from a larger group (the
                                                population). By studying the sample it is
Statistical Inference makes use of
                                                hoped to draw valid conclusions about
information from a sample to draw
                                                the larger group.
conclusions (inferences) about the
                                                        A sample is generally selected for
population from which the sample was
                                                study because the population is too
taken.
                                                large to study in its entirety.The sample
                                                should be representative of the general
                                                population. This is often best achieved
Experiment
                                                by random sampling. Also, before
                                                collecting the sample, it is important that
An experiment is any process or study
                                                the researcher carefully andcompletely
which results in the collection of data,
                                                defines the population, including a
the outcome of which is unknown. In             description of the members to be
statistics, the term is usually restricted to   included.
situations in which the researcher has
                                                Example
control over some of the conditions
under which the experiment takes place.
                                                       The population for a study of
                                                infant health might be all children born in
Example                                         the UK in the 1980's.The sample might
        Before introducing a new drug           be all babies born on 7th May in any of
treatment to reduce high blood                  the years.
pressure, the manufacturer carries out
an experiment to compare the                    Parameter
effectiveness of the new drug with that                 A parameter is a value, usually
of one currently prescribed. Newly              unknown (and which therefore has to be
diagnosed subjects are recruited from a         estimated), used to represent a certain
group of local general practices. Half of       population characteristic.For example,
them are chosen at random to receive            the population mean is a parameter that
the new drug, the remainder receiving           is often used to indicate the average
the present one. So, the researcher has         value of a quantity.Within a population,
control over the type of subject recruited      a parameter is a fixed value which does
and the way in which they are allocated         not vary. Eachsample drawn from the
to treatment.                                   population has its own value of any
                                                statistic that isused to estimate this
Experimental (or Sampling) Unit                 parameter. For example, the mean of
                                                the data in a sample is used to give
A unit is a person, animal, plant or thing      information about the overall mean in
which is actually studied by a                  the population from which that sample
researcher; the basic objects upon              was drawn. Parameters are often
which the study or experiment is carried        assigned Greek letters (e.g. ), whereas
out. For example, a person; a monkey; a         statistics are assigned Roman letters
sample of soil; a pot of seedlings; a           (e.g. s).
postcode area; a doctor's practice.

K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                       Page 1
Fundamentals of some Basic Statistical Definitions

Statistic                                     parameter µ;     is normally distributed
       A statistic is a quantity that is
                                              with expected value µ and variance
calculated from a sample of data. It is
                                              /n.
used to give information about unknown
values in the corresponding population.
                                              Estimate
For example, the average of the data in
a sample is used to give information
                                                    An estimate is an indication of the
about the overall average in the
                                              value of an unknown quantity based on
population from which that sample
                                              observed data.
wasdrawn. It is possible to draw more
than one sample from the same
population and the value of a statistic       More formally, an estimate is the
will in general vary from sample to           particular value of an estimator that is
sample. For example, the average value        obtained from a particular sample of
in a sample is a statistic. The average       data and used to indicate the value of a
values in more than one sample, drawn         parameter.
from the same population, will not
necessarily be equal. Statistics are often    Example
assigned Roman letters (e.g. m and s),               Suppose the manager of a shop
whereas the equivalent unknown values         wanted to know the mean expenditure
in the population (parameters ) are           of customers in her shop in the last
assigned Greek letters (e.g. µ and ).         year. She could calculate the average
                                              expenditure of the hundreds (or perhaps
Sampling Distribution                         thousands) of customers who bought
                                              goods in her shop, that is, the
The sampling distribution describes           population mean. Instead she could use
probabilities associated with a statistic     an estimate of this population mean by
when a random sample is drawn from a          calculating the mean of a representative
population.                                   sample of customers. If this value was
                                              found to be £25, then £25 would be her
The sampling distribution is the              estimate.
probability distribution or probability
density function of the statistic.
                                              Estimator
Derivation of the sampling distribution is
the first step in calculating a confidence           An estimator is any quantity
interval or carrying out a hypothesis test    calculated from the sample data which
for a parameter.                              is used to give information about an
                                              unknown quantity in the population. For
Example                                       example, the sample mean is an
Suppose that x1, ......., xn are a simple     estimator of the population mean.
random sample from a normally
distributed population with expected                 Estimators       of     population
                                              parameters         are        sometimes
value µ and known variance      . Then        distinguished from the true value by
the sample mean is a statistic used to        using the symbol 'hat'. For example,
give information about the population

K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                   Page 2
Fundamentals of some Basic Statistical Definitions

                                                Compare continuous data.
          = true population standard
       deviation
                                                Categorical Data
         = estimated (from a sample)
       population standard deviation
                                                       A set of data is said to be
                                                categorical if the values or observations
Example
                                                belonging to it can be sorted according
                                                to category. Each value is chosen from
The usual estimator of the population
                                                a set of non-overlapping categories. For
mean is
                                                example, shoes in a cupboard can be
                                                sorted according to colour: the
                                                characteristic 'colour' can have non-
where n is the size of the sample and           overlapping categories 'black', 'brown',
X1, X2, X3, ......., Xn are the values of the   'red' and 'other'. People have the
sample.                                         characteristic of 'gender' with categories
                                                'male' and 'female'.
If the value of the estimator in a
particular sample is found to be 5, then               Categories should be chosen
5 is the estimate of the population mean        carefully since a bad choice can
µ.                                              prejudice     the   outcome     of    an
                                                investigation. Every value should belong
                                                to one and only one category, and there
Estimation                                      should be no doubt as to which one.

      Estimation is the process by
which sample data are used to indicate          Nominal Data
the value of an unknown quantity in a
population.                                            A set of data is said to be
                                                nominal if the values / observations
Results of estimation can be expressed          belonging to it can be assigned a code
as a single value, known as a point             in the form of a number where the
estimate, or a range of values, known as        numbers are simply labels. You can
a confidence interval.                          count but not order or measure nominal
                                                data. For example, in a data set males
Discrete Data                                   could be coded as 0, females as 1;
                                                marital status of an individual could be
                                                coded as Y if married, N if single.
       A set of data is said to be
discrete if the values / observations
belonging to it are distinct and separate,
i.e. they can be counted (1,2,3,....).          Ordinal Data
Examples might include the number of
kittens in a litter; the number of patients     A set of data is said to be ordinal if the
in a doctors surgery; the number of             values / observations belonging to it can
flaws in one metre of cloth; gender             be ranked (put in order) or have a rating
(male, female); blood group (O, A, B,           scale attached. You can count and
AB).                                            order, but not measure, ordinal data.

K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                      Page 3
Fundamentals of some Basic Statistical Definitions

The categories for an ordinal set of data       count, order and measure continuous
have a natural order, for example,              data. For example height, weight,
suppose a group of people were asked            temperature, the amount of sugar in an
to taste varieties of biscuit and classify      orange, the time required to run a mile.
each biscuit on a rating scale of 1 to 5,
representing strongly dislike, dislike,         Compare discrete data.
neutral, like, strongly like. A rating of 5
indicates more enjoyment than a rating
of 4, for example, so such data are             Frequency Table
ordinal.
                                                       A frequency table is a way of
However, the distinction between                summarising a set of data. It is a record
neighbouring points on the scale is not         of how often each value (or set of
necessarily always the same. For                values) of the variable in question
instance, the difference in enjoyment           occurs. It may be enhanced by the
expressed by giving a rating of 2 rather        addition of percentages that fall into
than 1 might be much less than the              each category.
difference in enjoyment expressed by
giving a rating of 4 rather than 3.                    A frequency table is used to
                                                summarise categorical, nominal, and
                                                ordinal data. It may also be used to
Interval Scale                                  summarise continuous data once the
                                                data set has been divided up into
       An interval scale is a scale of          sensible groups.
measurement where the distance
between any two adjacents units of                     When we have more than one
measurement (or 'intervals') is the same        categorical variable in our data set, a
but the zero point is arbitrary. Scores on      frequency table is sometimes called a
an interval scale can be added and              contingency table because the figures
subtracted but can not be meaningfully          found in the rows are contingent upon
multiplied or divided. For example, the         (dependent upon) those found in the
time interval between the starts of years       columns.
1981 and 1982 is the same as that
between 1983 and 1984, namely 365               Example
days. The zero point, year 1 AD, is
arbitrary; time did not begin then. Other              Suppose that in thirty shots at a
examples of interval scales include the         target, a marksman makes the following
heights of tides, and the measurement           scores:
of longitude.

Continuous Data                                       52234 43203 03215

       A set of data is said to be                    13155 24004 54455
continuous if the values / observations
belonging to it may take on any value           The frequencies of the different scores
within a finite or infinite interval. You can   can be summarised as:

K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                     Page 4
Fundamentals of some Basic Statistical Definitions



                                              Bar Chart
       Score Frequency Frequency (%)                 A bar chart is a way of
         0       4         13%                summarising a set of categorical data. It
         1       3         10%                is often used in exploratory data
                                              analysis to illustrate the major features
         2       5         17%                of the distribution of the data in a
         3       5         17%                convenient form. It displays the data
         4       6         20%                using a number of rectangles, of the
         5       7         23%                same width, each of which represents a
                                              particular category. The length (and
                                              hence area) of each rectangle is
                                              proportional to the number of cases in
Pie Chart                                     the category it represents, for example,
                                              age group, religious affiliation.
       A pie chart is a way of
summarising a set of categorical data. It         Bar     charts    are    used      to
is a circle which is divided into             summarise nominal or ordinal data.
segments. Each segment represents a
particular category. The area of each                Bar charts can be displayed
segment is proportional to the number of      horizontally or vertically and they are
cases in that category.                       usually drawn with a gap between the
                                              bars (rectangles), whereas the bars of a
Example                                       histogram are drawn immediately next
        Suppose that, in the last year a      to each other.
sports wear manufacturers has spent 6
million pounds on advertising their
products; 3 million has been spent on
television adverts, 2 million on
sponsorship, 1 million on newspaper
adverts, and a half million on posters.
This spending can be summarised using
a pie chart:




K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                   Page 5
Fundamentals of some Basic Statistical Definitions

Dot Plot                                         become tedious to construct. A
                                                 histogram can also help detect any
       A dot plot is a way                 of    unusual observations (outliers), or any
summarising data, often used                in   gaps in the data set.
exploratory data analysis to illustrate   the
major features of the distribution of     the
data in a convenient form.

       For nominal or ordinal data, a dot
plot is similar to a bar chart, with the
bars replaced by a series of dots. Each
dot represents a fixed number of
individuals. For continuous data, the dot
plot is similar to a histogram, with the
rectangles replaced by dots.

      A dot plot can also help detect
any unusual observations (outliers), or
any gaps in the data set.                        Compare bar chart.


Histogram

       A histogram is a way of                   Stem and Leaf Plot
summarising data that are measured on
an interval scale (either discrete or            A stem and leaf plot is a way of
continuous). It is often used in                 summarising a set of data measured on
exploratory data analysis to illustrate the      an interval scale. It is often used in
major features of the distribution of the        exploratory data analysis to illustrate the
data in a convenient form. It divides up         major features of the distribution of the
the range of possible values in a data           data in a convenient and easily drawn
set into classes or groups. For each             form.
group, a rectangle is constructed with a
base length equal to the range of values                A stem and leaf plot is similar to a
in that specific group, and an area              histogram but is usually a more
proportional   to     the    number      of      informative display for relatively small
observations falling into that group. This       data sets (<100 data points). It provides
means that the rectangles might be               a table as well as a picture of the data
drawn of non-uniform height.                     and from it we can readily write down
                                                 the data in order of magnitude, which is
       The histogram is only appropriate         useful for many statistical procedures,
for variables whose values are                   e.g. in the skinfold thickness example
numerical and measured on an interval            below:
scale. It is generally used when dealing
with     large    data     sets   (>100
observations), when stem and leaf plots

K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                        Page 6
Fundamentals of some Basic Statistical Definitions

                                                observations are involved and when two
                                                or more data sets are being compared.




       We can compare more than one             5-Number Summary
data set by the use of multiple stem and
leaf plots. By using a back-to
                            to-back stem              A    5-number
                                                              number       summary      is
and leaf plot, we are able to compare           especially useful when we have so
the same characteristic in two different        many data that it is sufficient to present
groups, for example, pulse rate after           a summary of the data rather than the
exercise of smokers and non-  -smokers.         whole data set. It consists of 5 values:
                                                the most extreme values in the data set
Box and Whisker Plot (or Boxplot)               (maximum and minimum values), the
                                                                     imum
                                                lower and upper quartiles, and the
                                                                     quartiles
       A box and whisker plot is a way          median.
of summarising a set of data measured
on an interval scale. It is often used in
exploratory data analysis. It is a type of
graph which is used to show the shape                 A 5-number summary can be
of the distribution, its central value, and     represented in a diagram known as a
variability.   The      picture   produced      box and whisker plot. In cases where we
                                                                    .
consists of the most extreme values in          have more than one data set to analyse,
the data set (maximum and minimum               a 5-number summary is constructed for
                                                    number
values), the lower and upper quartiles,         each, with corresponding multiple box
and the median.                                 and whisker plots.

        A box plot (as it is often called) is
especially helpful for indicating whether       Outlier
a distribution is skewed and whether
there are any unusual observations
                                                       An outlier is an observation in a
(outliers) in the data set.
                                                data set which is far removed in value
                                                from the others in the data set. It is an
      Box and whisker plots are also
                                                unusually large or an unusually small
very useful when large numbers of
                                                value compared to the others.

K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                      Page 7
Fundamentals of some Basic Statistical Definitions


                                               Skewness
        An outlier might be the result of
an error in measurement, in which case                Skewness        is    defined      as
it will distort the interpretation of the      asymmetry in the distribution of the
data, having undue influence on many           sample data values. Values on one side
summary statistics, for example, the           of the distribution tend to be further from
mean.                                          the 'middle' than values on the other
                                               side.
       If an outlier is a genuine result, it
is important because it might indicate an           For skewed data, the usual
extreme of behaviour of the process            measures of location will give different
under study. For this reason, all outliers     values,forexample,mode<median<mean
must be examined carefully before              would indicate positive (or right)
embarking on any formal analysis.              skewness.
Outliers should not routinely be removed
without further justification.                       Positive (or right) skewness is
                                               more common than negative (or left)
                                               skewness.
Symmetry
                                                       If there is evidence of skewness
      Symmetry is implied when data            in    the     data,    we    can    apply
values are distributed in the same way         transformations, for example, taking
above and below the middle of the              logarithms of positive skew data.
sample.
                                               Compare symmetry.
Symmetrical data sets:
                                               Transformation to Normality
   a. are easily interpreted;
   b. allow a balanced attitude to                    If there is evidence of marked
      outliers, that is, those above and       non-normality then we may be able to
      below the middle value ( median)         remedy this by applying suitable
      can be considered by the same            transformations.
      criteria;
   c. allow comparisons of spread or                  The more commonly used
      dispersion with similar data sets.       transformations which are appropriate
                                               for data which are skewed to the right
    Many standard statistical techniques       with increasing strength (positive skew)
are appropriate only for a symmetric           are 1/x, log(x) and sqrt(x), where the x's
distributional form. For this reason,          are the data values.
attempts are often made to transform
skew-symmetric data so that they                      The more commonly used
become roughly symmetric.                      transformations which are appropriate
                                               for data which are skewed to the left
                                               with increasing strength (negative skew)
                                               are squaring, cubing, and exp(x).

K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                       Page 8
Fundamentals of some Basic Statistical Definitions

Scatter Plot                                        between the two variables is
                                                    negative (inverse).
       A scatterplot is a useful summary         d. If there exists a random scatter of
of a set of bivariate data (two variables),         points, there is no relationship
usually drawn before working out a                  between the two variables (very
linear correlation coefficient or fitting a         low or zero correlation).
regression line. It gives a good visual          e. Very low or zero correlation could
picture of the relationship between the             result     from     a     non-linear
two variables, and aids the interpretation          relationship       between       the
of the correlation coefficient or                   variables. If the relationship is in
regression model.                                   fact non-linear (points clustering
                                                    around a curve, not a straight
       Each unit contributes one point to           line), the correlation coefficient
the scatterplot, on which points are                will not be a good measure of the
plotted but not joined. The resulting               strength.
pattern indicates the type and strength
of the relationship between the two               A scatterplot will also show up a non-
variables.                                    linear relationship between the two
                                              variables and whether or not there exist
                                              any outliers in the data.

                                                  More information can be added to a
                                              two-dimensional    scatterplot -    for
                                              example, we might label points with a
                                              code to indicate the level of a third
                                              variable.

                                                  If we are dealing with many variables
                                              in a data set, a way of presenting all
                                              possible scatter plots of two variables at
                                              a time is in a scatterplot matrix.
Illustrations

   a. The more the points tend to             Sample Mean
      cluster around a straight line, the
      stronger the linear relationship               The sample mean is an estimator
      between the two variables (the          available for estimating the population
      higher the correlation).                mean . It is a measure of location,
   b. If the line around which the points     commonly called the average, often
      tends to cluster runs from lower
      left to upper right, the relationship   symbolised    .
      between the two variables is
      positive (direct).                             Its value depends equally on all
   c. If the line around which the points     of the data which may include outliers. It
      tends to cluster runs from upper        may not appear representative of the
      left to lower right, the relationship   central region for skewed data sets.


K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                    Page 9
Fundamentals of some Basic Statistical Definitions

       It is especially useful as being                     57 55 85 24 33 49 94 2 8
representative of the whole sample for              Data    51 71 30 91 6 47 50 65 43
use in subsequent calculations.                             41 7
                                                            2 6 7 8 24 30 33 41 43 47
Example                                             Ordered
                                                            49 50 51 55 57 65 71 85
       Lets say our data set is: 5 3 54             Data
                                                            91 94
93 83 22 17 19.
                                                    Median Halfway between the two
       The sample mean is calculated
                                                            'middle' data points - in
by taking the sum of all the data values
                                                            this case halfway between
and dividing by the total number of data
                                                            47 and 49, and so the
values:
                                                            median is 48


                                              Mode
Median
                                                     The mode is the most frequently
       The median is the value halfway        occurring value in a set of discrete data.
through the ordered data set, below and       There can be more than one mode if
above which there lies an equal number        two or more values are equally
of data values.                               common.

        It is generally a good descriptive    Example
measure of the location which works
well for skewed data, or data with                   Suppose the results of an end of
outliers.                                     term Statistics exam were distributed as
                                              follows:
The median is the 0.5 quantile.

Example                                              Student: Score:</I.< td>
                                                         1          94
      With an odd number of data                         2          81
values, for example 21, we have:
                                                         3          56
                 96 48 27 72 39 70 7 68
      Data       99 36 95 4 6 13 34 74 65                4          90
                 42 28 54 69                             5          70
                 4 6 7 13 27 28 34 36 39                 6          65
      Ordered
                 42 48 54 65 68 69 70 72                 7          90
      Data
                 74 95 96 99                             8          90
                 48, leaving ten values                  9          30
      Median     below and ten values
                                                     Then the mode (most common
                 above
                                              score) is 90, and the median (middle
                                              score) is 81.
With an even number of data values, for
example 20, we have:

K.MANOJ.M.Sc.,M.phil.,D.C.A.,                                                   Page 10

More Related Content

Similar to Book001(statweb.blogspot.com)

Similar to Book001(statweb.blogspot.com) (20)

SAMPLING IN RESEARCH METHODOLOGY
SAMPLING IN RESEARCH METHODOLOGYSAMPLING IN RESEARCH METHODOLOGY
SAMPLING IN RESEARCH METHODOLOGY
 
Statistics
StatisticsStatistics
Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Sample
SampleSample
Sample
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statistics
 
Inferential statictis ready go
Inferential statictis ready goInferential statictis ready go
Inferential statictis ready go
 
Sampling Distribution
Sampling DistributionSampling Distribution
Sampling Distribution
 
Sampling
SamplingSampling
Sampling
 
Estimation in statistics
Estimation in statisticsEstimation in statistics
Estimation in statistics
 
Research Method EMBA chapter 10
Research Method EMBA chapter 10Research Method EMBA chapter 10
Research Method EMBA chapter 10
 
Basics of biostatistic
Basics of biostatisticBasics of biostatistic
Basics of biostatistic
 
Bgy5901
Bgy5901Bgy5901
Bgy5901
 
5_lectureslides.pptx
5_lectureslides.pptx5_lectureslides.pptx
5_lectureslides.pptx
 
Basic of Statistical Inference Part-I
Basic of Statistical Inference Part-IBasic of Statistical Inference Part-I
Basic of Statistical Inference Part-I
 
Practice Test 1 solutions
Practice Test 1 solutions  Practice Test 1 solutions
Practice Test 1 solutions
 
Data analysis
Data analysis Data analysis
Data analysis
 
Statistics ppt.ppt
Statistics ppt.pptStatistics ppt.ppt
Statistics ppt.ppt
 
Sampling distribution
Sampling distributionSampling distribution
Sampling distribution
 

Recently uploaded

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Recently uploaded (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Book001(statweb.blogspot.com)

  • 1. Fundamentals of some Basic Statistical Definitions Basic Definitions Sample A sample is a group of units Statistical Inference selected from a larger group (the population). By studying the sample it is Statistical Inference makes use of hoped to draw valid conclusions about information from a sample to draw the larger group. conclusions (inferences) about the A sample is generally selected for population from which the sample was study because the population is too taken. large to study in its entirety.The sample should be representative of the general population. This is often best achieved Experiment by random sampling. Also, before collecting the sample, it is important that An experiment is any process or study the researcher carefully andcompletely which results in the collection of data, defines the population, including a the outcome of which is unknown. In description of the members to be statistics, the term is usually restricted to included. situations in which the researcher has Example control over some of the conditions under which the experiment takes place. The population for a study of infant health might be all children born in Example the UK in the 1980's.The sample might Before introducing a new drug be all babies born on 7th May in any of treatment to reduce high blood the years. pressure, the manufacturer carries out an experiment to compare the Parameter effectiveness of the new drug with that A parameter is a value, usually of one currently prescribed. Newly unknown (and which therefore has to be diagnosed subjects are recruited from a estimated), used to represent a certain group of local general practices. Half of population characteristic.For example, them are chosen at random to receive the population mean is a parameter that the new drug, the remainder receiving is often used to indicate the average the present one. So, the researcher has value of a quantity.Within a population, control over the type of subject recruited a parameter is a fixed value which does and the way in which they are allocated not vary. Eachsample drawn from the to treatment. population has its own value of any statistic that isused to estimate this Experimental (or Sampling) Unit parameter. For example, the mean of the data in a sample is used to give A unit is a person, animal, plant or thing information about the overall mean in which is actually studied by a the population from which that sample researcher; the basic objects upon was drawn. Parameters are often which the study or experiment is carried assigned Greek letters (e.g. ), whereas out. For example, a person; a monkey; a statistics are assigned Roman letters sample of soil; a pot of seedlings; a (e.g. s). postcode area; a doctor's practice. K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 1
  • 2. Fundamentals of some Basic Statistical Definitions Statistic parameter µ; is normally distributed A statistic is a quantity that is with expected value µ and variance calculated from a sample of data. It is /n. used to give information about unknown values in the corresponding population. Estimate For example, the average of the data in a sample is used to give information An estimate is an indication of the about the overall average in the value of an unknown quantity based on population from which that sample observed data. wasdrawn. It is possible to draw more than one sample from the same population and the value of a statistic More formally, an estimate is the will in general vary from sample to particular value of an estimator that is sample. For example, the average value obtained from a particular sample of in a sample is a statistic. The average data and used to indicate the value of a values in more than one sample, drawn parameter. from the same population, will not necessarily be equal. Statistics are often Example assigned Roman letters (e.g. m and s), Suppose the manager of a shop whereas the equivalent unknown values wanted to know the mean expenditure in the population (parameters ) are of customers in her shop in the last assigned Greek letters (e.g. µ and ). year. She could calculate the average expenditure of the hundreds (or perhaps Sampling Distribution thousands) of customers who bought goods in her shop, that is, the The sampling distribution describes population mean. Instead she could use probabilities associated with a statistic an estimate of this population mean by when a random sample is drawn from a calculating the mean of a representative population. sample of customers. If this value was found to be £25, then £25 would be her The sampling distribution is the estimate. probability distribution or probability density function of the statistic. Estimator Derivation of the sampling distribution is the first step in calculating a confidence An estimator is any quantity interval or carrying out a hypothesis test calculated from the sample data which for a parameter. is used to give information about an unknown quantity in the population. For Example example, the sample mean is an Suppose that x1, ......., xn are a simple estimator of the population mean. random sample from a normally distributed population with expected Estimators of population parameters are sometimes value µ and known variance . Then distinguished from the true value by the sample mean is a statistic used to using the symbol 'hat'. For example, give information about the population K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 2
  • 3. Fundamentals of some Basic Statistical Definitions Compare continuous data. = true population standard deviation Categorical Data = estimated (from a sample) population standard deviation A set of data is said to be categorical if the values or observations Example belonging to it can be sorted according to category. Each value is chosen from The usual estimator of the population a set of non-overlapping categories. For mean is example, shoes in a cupboard can be sorted according to colour: the characteristic 'colour' can have non- where n is the size of the sample and overlapping categories 'black', 'brown', X1, X2, X3, ......., Xn are the values of the 'red' and 'other'. People have the sample. characteristic of 'gender' with categories 'male' and 'female'. If the value of the estimator in a particular sample is found to be 5, then Categories should be chosen 5 is the estimate of the population mean carefully since a bad choice can µ. prejudice the outcome of an investigation. Every value should belong to one and only one category, and there Estimation should be no doubt as to which one. Estimation is the process by which sample data are used to indicate Nominal Data the value of an unknown quantity in a population. A set of data is said to be nominal if the values / observations Results of estimation can be expressed belonging to it can be assigned a code as a single value, known as a point in the form of a number where the estimate, or a range of values, known as numbers are simply labels. You can a confidence interval. count but not order or measure nominal data. For example, in a data set males Discrete Data could be coded as 0, females as 1; marital status of an individual could be coded as Y if married, N if single. A set of data is said to be discrete if the values / observations belonging to it are distinct and separate, i.e. they can be counted (1,2,3,....). Ordinal Data Examples might include the number of kittens in a litter; the number of patients A set of data is said to be ordinal if the in a doctors surgery; the number of values / observations belonging to it can flaws in one metre of cloth; gender be ranked (put in order) or have a rating (male, female); blood group (O, A, B, scale attached. You can count and AB). order, but not measure, ordinal data. K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 3
  • 4. Fundamentals of some Basic Statistical Definitions The categories for an ordinal set of data count, order and measure continuous have a natural order, for example, data. For example height, weight, suppose a group of people were asked temperature, the amount of sugar in an to taste varieties of biscuit and classify orange, the time required to run a mile. each biscuit on a rating scale of 1 to 5, representing strongly dislike, dislike, Compare discrete data. neutral, like, strongly like. A rating of 5 indicates more enjoyment than a rating of 4, for example, so such data are Frequency Table ordinal. A frequency table is a way of However, the distinction between summarising a set of data. It is a record neighbouring points on the scale is not of how often each value (or set of necessarily always the same. For values) of the variable in question instance, the difference in enjoyment occurs. It may be enhanced by the expressed by giving a rating of 2 rather addition of percentages that fall into than 1 might be much less than the each category. difference in enjoyment expressed by giving a rating of 4 rather than 3. A frequency table is used to summarise categorical, nominal, and ordinal data. It may also be used to Interval Scale summarise continuous data once the data set has been divided up into An interval scale is a scale of sensible groups. measurement where the distance between any two adjacents units of When we have more than one measurement (or 'intervals') is the same categorical variable in our data set, a but the zero point is arbitrary. Scores on frequency table is sometimes called a an interval scale can be added and contingency table because the figures subtracted but can not be meaningfully found in the rows are contingent upon multiplied or divided. For example, the (dependent upon) those found in the time interval between the starts of years columns. 1981 and 1982 is the same as that between 1983 and 1984, namely 365 Example days. The zero point, year 1 AD, is arbitrary; time did not begin then. Other Suppose that in thirty shots at a examples of interval scales include the target, a marksman makes the following heights of tides, and the measurement scores: of longitude. Continuous Data 52234 43203 03215 A set of data is said to be 13155 24004 54455 continuous if the values / observations belonging to it may take on any value The frequencies of the different scores within a finite or infinite interval. You can can be summarised as: K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 4
  • 5. Fundamentals of some Basic Statistical Definitions Bar Chart Score Frequency Frequency (%) A bar chart is a way of 0 4 13% summarising a set of categorical data. It 1 3 10% is often used in exploratory data analysis to illustrate the major features 2 5 17% of the distribution of the data in a 3 5 17% convenient form. It displays the data 4 6 20% using a number of rectangles, of the 5 7 23% same width, each of which represents a particular category. The length (and hence area) of each rectangle is proportional to the number of cases in Pie Chart the category it represents, for example, age group, religious affiliation. A pie chart is a way of summarising a set of categorical data. It Bar charts are used to is a circle which is divided into summarise nominal or ordinal data. segments. Each segment represents a particular category. The area of each Bar charts can be displayed segment is proportional to the number of horizontally or vertically and they are cases in that category. usually drawn with a gap between the bars (rectangles), whereas the bars of a Example histogram are drawn immediately next Suppose that, in the last year a to each other. sports wear manufacturers has spent 6 million pounds on advertising their products; 3 million has been spent on television adverts, 2 million on sponsorship, 1 million on newspaper adverts, and a half million on posters. This spending can be summarised using a pie chart: K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 5
  • 6. Fundamentals of some Basic Statistical Definitions Dot Plot become tedious to construct. A histogram can also help detect any A dot plot is a way of unusual observations (outliers), or any summarising data, often used in gaps in the data set. exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. For nominal or ordinal data, a dot plot is similar to a bar chart, with the bars replaced by a series of dots. Each dot represents a fixed number of individuals. For continuous data, the dot plot is similar to a histogram, with the rectangles replaced by dots. A dot plot can also help detect any unusual observations (outliers), or any gaps in the data set. Compare bar chart. Histogram A histogram is a way of Stem and Leaf Plot summarising data that are measured on an interval scale (either discrete or A stem and leaf plot is a way of continuous). It is often used in summarising a set of data measured on exploratory data analysis to illustrate the an interval scale. It is often used in major features of the distribution of the exploratory data analysis to illustrate the data in a convenient form. It divides up major features of the distribution of the the range of possible values in a data data in a convenient and easily drawn set into classes or groups. For each form. group, a rectangle is constructed with a base length equal to the range of values A stem and leaf plot is similar to a in that specific group, and an area histogram but is usually a more proportional to the number of informative display for relatively small observations falling into that group. This data sets (<100 data points). It provides means that the rectangles might be a table as well as a picture of the data drawn of non-uniform height. and from it we can readily write down the data in order of magnitude, which is The histogram is only appropriate useful for many statistical procedures, for variables whose values are e.g. in the skinfold thickness example numerical and measured on an interval below: scale. It is generally used when dealing with large data sets (>100 observations), when stem and leaf plots K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 6
  • 7. Fundamentals of some Basic Statistical Definitions observations are involved and when two or more data sets are being compared. We can compare more than one 5-Number Summary data set by the use of multiple stem and leaf plots. By using a back-to to-back stem A 5-number number summary is and leaf plot, we are able to compare especially useful when we have so the same characteristic in two different many data that it is sufficient to present groups, for example, pulse rate after a summary of the data rather than the exercise of smokers and non- -smokers. whole data set. It consists of 5 values: the most extreme values in the data set Box and Whisker Plot (or Boxplot) (maximum and minimum values), the imum lower and upper quartiles, and the quartiles A box and whisker plot is a way median. of summarising a set of data measured on an interval scale. It is often used in exploratory data analysis. It is a type of graph which is used to show the shape A 5-number summary can be of the distribution, its central value, and represented in a diagram known as a variability. The picture produced box and whisker plot. In cases where we . consists of the most extreme values in have more than one data set to analyse, the data set (maximum and minimum a 5-number summary is constructed for number values), the lower and upper quartiles, each, with corresponding multiple box and the median. and whisker plots. A box plot (as it is often called) is especially helpful for indicating whether Outlier a distribution is skewed and whether there are any unusual observations An outlier is an observation in a (outliers) in the data set. data set which is far removed in value from the others in the data set. It is an Box and whisker plots are also unusually large or an unusually small very useful when large numbers of value compared to the others. K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 7
  • 8. Fundamentals of some Basic Statistical Definitions Skewness An outlier might be the result of an error in measurement, in which case Skewness is defined as it will distort the interpretation of the asymmetry in the distribution of the data, having undue influence on many sample data values. Values on one side summary statistics, for example, the of the distribution tend to be further from mean. the 'middle' than values on the other side. If an outlier is a genuine result, it is important because it might indicate an For skewed data, the usual extreme of behaviour of the process measures of location will give different under study. For this reason, all outliers values,forexample,mode<median<mean must be examined carefully before would indicate positive (or right) embarking on any formal analysis. skewness. Outliers should not routinely be removed without further justification. Positive (or right) skewness is more common than negative (or left) skewness. Symmetry If there is evidence of skewness Symmetry is implied when data in the data, we can apply values are distributed in the same way transformations, for example, taking above and below the middle of the logarithms of positive skew data. sample. Compare symmetry. Symmetrical data sets: Transformation to Normality a. are easily interpreted; b. allow a balanced attitude to If there is evidence of marked outliers, that is, those above and non-normality then we may be able to below the middle value ( median) remedy this by applying suitable can be considered by the same transformations. criteria; c. allow comparisons of spread or The more commonly used dispersion with similar data sets. transformations which are appropriate for data which are skewed to the right Many standard statistical techniques with increasing strength (positive skew) are appropriate only for a symmetric are 1/x, log(x) and sqrt(x), where the x's distributional form. For this reason, are the data values. attempts are often made to transform skew-symmetric data so that they The more commonly used become roughly symmetric. transformations which are appropriate for data which are skewed to the left with increasing strength (negative skew) are squaring, cubing, and exp(x). K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 8
  • 9. Fundamentals of some Basic Statistical Definitions Scatter Plot between the two variables is negative (inverse). A scatterplot is a useful summary d. If there exists a random scatter of of a set of bivariate data (two variables), points, there is no relationship usually drawn before working out a between the two variables (very linear correlation coefficient or fitting a low or zero correlation). regression line. It gives a good visual e. Very low or zero correlation could picture of the relationship between the result from a non-linear two variables, and aids the interpretation relationship between the of the correlation coefficient or variables. If the relationship is in regression model. fact non-linear (points clustering around a curve, not a straight Each unit contributes one point to line), the correlation coefficient the scatterplot, on which points are will not be a good measure of the plotted but not joined. The resulting strength. pattern indicates the type and strength of the relationship between the two A scatterplot will also show up a non- variables. linear relationship between the two variables and whether or not there exist any outliers in the data. More information can be added to a two-dimensional scatterplot - for example, we might label points with a code to indicate the level of a third variable. If we are dealing with many variables in a data set, a way of presenting all possible scatter plots of two variables at a time is in a scatterplot matrix. Illustrations a. The more the points tend to Sample Mean cluster around a straight line, the stronger the linear relationship The sample mean is an estimator between the two variables (the available for estimating the population higher the correlation). mean . It is a measure of location, b. If the line around which the points commonly called the average, often tends to cluster runs from lower left to upper right, the relationship symbolised . between the two variables is positive (direct). Its value depends equally on all c. If the line around which the points of the data which may include outliers. It tends to cluster runs from upper may not appear representative of the left to lower right, the relationship central region for skewed data sets. K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 9
  • 10. Fundamentals of some Basic Statistical Definitions It is especially useful as being 57 55 85 24 33 49 94 2 8 representative of the whole sample for Data 51 71 30 91 6 47 50 65 43 use in subsequent calculations. 41 7 2 6 7 8 24 30 33 41 43 47 Example Ordered 49 50 51 55 57 65 71 85 Lets say our data set is: 5 3 54 Data 91 94 93 83 22 17 19. Median Halfway between the two The sample mean is calculated 'middle' data points - in by taking the sum of all the data values this case halfway between and dividing by the total number of data 47 and 49, and so the values: median is 48 Mode Median The mode is the most frequently The median is the value halfway occurring value in a set of discrete data. through the ordered data set, below and There can be more than one mode if above which there lies an equal number two or more values are equally of data values. common. It is generally a good descriptive Example measure of the location which works well for skewed data, or data with Suppose the results of an end of outliers. term Statistics exam were distributed as follows: The median is the 0.5 quantile. Example Student: Score:</I.< td> 1 94 With an odd number of data 2 81 values, for example 21, we have: 3 56 96 48 27 72 39 70 7 68 Data 99 36 95 4 6 13 34 74 65 4 90 42 28 54 69 5 70 4 6 7 13 27 28 34 36 39 6 65 Ordered 42 48 54 65 68 69 70 72 7 90 Data 74 95 96 99 8 90 48, leaving ten values 9 30 Median below and ten values Then the mode (most common above score) is 90, and the median (middle score) is 81. With an even number of data values, for example 20, we have: K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 10