SlideShare a Scribd company logo
1 of 61
Download to read offline
A PROJECT REPORT
ON
“A STATISTICAL ANALYSIS OF SALES FROM SUPERMARKET”
SUBMITTED BY
CHAUDHARI SURAJKUMAR D.
HIRAPARA HIREN M.
MISTRY RADHESH S.
NADKARNI SAHIL K.
PASI VIPULKUMAR K.
PATEL JINAL D.
IN PARTIAL FULFILLMENT OF THE DEGREE
OF MASTER OF SCIENCE IN STATISTICS
GUIDED BY
Dr. ARTI RAJYAGURU
DEPARTMENT OF STATISTICS
VEER NARMAD SOUTH GUJARAT UNIVERSITY
SURAT
2021
Certificate
DEPARTMENT OF STATISTICS
VEER NARMAD SOUTH GUJARAT UNIVERSITY, SURAT
(Re-Accredited with 'Grade- A' by NAAC)
This is to certify that project on "A Statistical Analysis of Sales from
Supermarket" submitted by “Chaudhari Suraj D. (Roll No.03),
Hirapara Hiren M. (Roll No,05), Mistry Radhesh S. (Roll No.06),
Nadkarni Sahil K. (Roll No.07), Pasi Vipul k. (Roll No.12), and
Patel Jinal D. (Roll No. 13) students of M.Sc. Statistics (Semester-
IV)" for the academic year 2020-21, to the department of Statistics V.
N. S. G. University, Surat as a partial fulfillrnent for the degree of
M.Sc. (Statistics).
PROF. & HEAD
Department of Statistics
Veer Narmad South Gujarat University
Surat-395007
THIS PROJECT IS
DEDICATED
TO
THE DEPARTMENT OF
STATISTICS
ALL OUR PROFESSORS
OUR GUIDE
AND
OUR GROUP
ACKNOWLEDGEMENT
We are highly grateful to the honorable Dr. A. J. Rajyaguru, The Head of
The Department of Statistics, V.N.S.G.U. - Surat, for her ever-helping attitude
and encouraging us to achieve excel in studies.She has not only made us to work
but also guided us to orient towards research.
We are also thankful to the entire staff of the department and to all
those who have helped us or supported us directly or indirectly.
This acknowledgment will not be complete until we pay our gratitude
to our family, whose enthusiasm to see this work complete was as infectious as
their inspiration.
(CHAUDHARI SURAJKUMAR D.) (HIRAPARA HIREN M.)
(MISTRY RADHESH S.) (NADKARNI SAHIL K.)
(PASI VIPULKUMAR K.) (PATEL JINAL D.)
DECLARATION
We, the students of M.Sc. (Statistics) [Semester IV] of the
Department of Statistics at VEER NARMAD SOUTH GUJARAT
UNIVERSITY, Surat, hereby declare that we have completed our
project entitled “A STATISTICAL ANALYSIS OFSALES FROM
SUPERMARKET”in the Academic year 2020-21. The information
submitted here in is true and original to the best of our knowledge.
(CHAUDHARI SURAJKUMAR D.) (HIRAPARA HIREN M.)
(MISTRY RADHESH S.) (NADKARNI SAHIL K.)
(PASI VIPULKUMAR K.) (PATEL JINAL D.)
INDEX
SECTIONS TITLE PAGE
NO
1 INTRODUCTION 1
1.0 INTRODUCTION OF SUPERMARKET 2
1.1 OBJECTIVES OF THE STUDY 4
1.2DATA COLLECTION 5
1.3STATISTICAL TECHNIQUES 8
2 STATISTICAL ANALYSIS
31
2.0 WHAT IS STATISTICAL ANALYSIS? 32
2.1 ANALYSIS AND INTERPRETATION 33
2.2FINDINGS 50
2.3LIMITATIONS 53
• REFERENCES 54
1
SECTION – 1
INTRODUCTION
2
Statistics plays a vital role in our day-to-day life. Statistics has been
defined by different authors for a variety of definitions. Also, the varied
and outstanding contribution of Prof. “R.A. Fisher” put the subject of
statistics on a very firm footing and earned the status of fully fledged
science.
According to BOWLEY “Statistics are numerical statement of facts in
any department of equity placed in relation to each other.”
“Statistics is the grammar of science.”
“A research journal serves that narrow borderland which
separates the known from the unknown.”
- Prasanta C. Mahalanobis
(1.0)Introduction of Supermarket :
What is Supermarket…?
A supermarket is a self-service shop offering a wide variety of food,
beverages and household products, organized into sections. It is larger
and has a wider selection than earlier grocery stores, but is smaller
and more limited in the range of merchandise than a hypermarket or
big-box market.
In everyday INDIA usage, however, "grocery store" is a synonym for
supermarket, and is not used to refer to other types of stores that sell
groceries.
Supermarkets typically are chain stores, supplied by the distribution
centers of their parent companies, thus increasing opportunities for
economies of scale. Supermarkets usually offer products at relatively
low prices by using their buying power to buy goods from
manufacturers at lower prices than smaller stores can. They also
minimize financing costs by paying for goods at least 30 days after
3
receipt and some extract credit terms of 90 days or more from
vendors. Certain products (typically staple foods such as bread, milk
and sugar) are very occasionally sold as loss leaders so as to attract
shoppers to their store. Supermarkets make up for their low margins
by a high volume of sales, and with of higher-margin items bought by
the attracted shoppers. Self-service with shopping carts (trolleys) or
baskets reduces labor costs, and many supermarket chains are
attempting further reduction by shifting to self-service check-out.
Real-time data allows grocery stores and supermarkets to forecast the
potential sales and demand of their items through predictive analytics,
highlighting which items are in demand and those to discard.
Essentially, they use so-called Recency, Frequency, Value (or RFV)
analysis to look at the transactional behavior of their customers and to
score customers using a combination of how often they shop, how
many items they purchase and how much they spend.
Some famous Supermarket in India like 7-Eleven, BigBazaar, D-
Mart, Easyday, Foodworld, HyperCity, Lulu Hypermarket and Maveli
Stores.
Here, we include some picture of Supermarket…
4
(1.1)Objective of the study :
1) To visualize how explanatory variablei.e., Branch,
Customer type, Gender, Product line and Payment type
affect to study variable sales.
2) To check Main and Interaction effect of explanatory
variable on sales.
3) Fitting appropriate Time Series Model and analyzing
study variable sales.
5
(1.2)Data Collection :
Data collection:
We have used secondary data. The Data of supermarket sales were
collected from Kaggle website.
The data were collected from 1 January 2019 to 30 march 2019. The
analysis of supermarket is done on the base of these secondary data.
Information about Kaggle website:
Kaggle got its start in 2010 by offering machine learning competitions
and now also offers a public data platform, a cloud-based workbench
for data science, and artificial intelligence education. Its key
personnel were Anthony Goldbloom and Jeremy Howard. Nicholas
Gruen was founding chair succeeded by max Levchin. Equity was
raised in 2011 valuing the company at $25 million. On 8 march 2017,
Google announced that they were acquiring Kaggle.
Kaggle, a subsidiary of Google LLC, is an online community of data
scientists and machine learningpractitioners. Kaggle allows users to
find and publish data sets, explore and build models in a wed-based
data science environment, work with other data scientists and
machine learning engineers, and enter competitions to solve data
science challenges.
Kaggle services edit
Machine Learning Competitions:This was Kaggle’s first product.
Companies post problem and machine learners compete to build the
best algorithm, typically with cash prizes.
Kaggle kernels: a cloud-based workbench for data science and machine
learning. Allows data scientists to share code and analysis in python, R
6
and R markdown. Over 150k “kernels”(code snippets) have been
shared on Kaggle covering everything from sentiment analysis to object
detection.
Public datasets platform: community members share datasets with each
other. Has datasets on everything from bone x-rays to results from
boxing bouts.
Kaggle learn: a platform for all education in manageable chunks.
Type: subsidiary
Industry: data science
Founded: April 2010
Founder: Anthony Goldbloom, Ben Hamner
Headquarters:San Francisco, United states
Key people: Anthony Goldblood (CEO), Ben Hamner (CTO), Jeff
Moser (chief architect)
Products: Competitions, Kaggle kernels, Kaggle datasets, Kaggle learn
Owner: Alphabet inc. (2017 present)
Parent: Google (2017 present)
Data Attribute information:
Invoice id: computer generated sales slip invoice identification number
Branch: branch of supercenters (3 branches are available identified by
A, B, C)
City: location of supercenters
Customer type: Type of customers, recorded by member for
customers using member card and normal for without member card
Gender: gender type of customer
7
Product line: general item Categorization groups- electronic
accessories, fashion accessories, food and beverages, health and beauty,
home and lifestyle, sports and travel
Unit price: price of each product in $
Quantity: number of products purchased by customer
Tax: 5% tax fee for customer buying
Total: total price including tax
Date: date of purchase (record available from 1 January 2019 to 30
march 2019)
Time: purchase time (10am to 9am)
Payment: payment used by customer for purchase (3 method are
available – cash, credit card and e-wallet)
Cogs: cost of goods sold
Gross margin Percentage: gross margin percentage
Gross Income: gross income
Rating: customer stratification rating on their overall shopping
experience(on a scale of 1 to 10)
Data Formation
Categorical Data Branch, City, Gender, Customer type,
Product line, Payment type
Numerical Data COGS (Cost of Goods Sold), Rating
Date Date, Time
8
(1.3)Statistical Techniques :
“Statistics maybe defined as the science of collection, presentation,
analysis and interpretation of numerical data.”
Thus, we use statistical concepts to analyze the data. The brief
introduction of the several methods used to analyze the data is as
follows:
(1.3.1)Descriptive Statistics:
Descriptive statistics are used to describe the basic features of the data
in the study. They provide simple summaries about the sample and
the measures with simple graphical analysis.
Descriptive statistics are typically distinguished from inferential
statistics. With descriptive statistics we are simply describing what is
or what the data shows. We have used some descriptive statistics like
frequency distributions, Crosstabs and Charts, Measure of Central
Tendency, Measure of Dispersion,Skewness, Kurtosis, Range etc.
Cross Tabulation:
A cross tabulation displays the joint distribution of two or more
variables. They are usually presented as a contingency table in a
matrix format which describes the distribution of two or more
variables simultaneously.
(1.3.2) Graphical Representation:
Charts:
Once the data have been collected, the crucial problem becomes
learning, whatever we can, from the data. Graph is a powerful tool of
describing the dataset. A large dataset is required to be presented in
9
graphical form that can capture the structure of underlying data. A
quick glance at the picture elucidates the point easily then does a page
filled with words and numbers. The term charts as a visual
representation of data have multiple meaning.
1. Pie Charts:
Pie chart is also called “angular charts”. A circle divided into portions
that represent the relative frequencies or percentages of different
categories or classes. This chart represents the value of the variable in
the relative form of 360o
. The area of 360o
is divided into slices.
2. Line Charts:
Line Charts are used to show the pattern of changes over a period of
time, called trend. For Example, to observe the pattern of, change in
height of child, closing of stocks, GDP of country. Line charts are
used to observe these patterns.
(1.3.3) GLM Univariate Analysis:
The GLM Univariate procedure provides regression analysis and
analysis of variance for one dependent variable by one or more factors
and/or variables. The factor variables divide the population into
groups. Using this General Linear Model procedure, you can test null
hypotheses about the effects of other variables on the means of
various groupings of a single dependent variable. You can investigate
interactions between factors as well as the effects of individual
factors, some of which may be random. In addition, the effects of
covariates and covariate interactions with factors can be included. For
regression analysis, the explanatory (predictor) variables are specified
as covariates.
Both balanced and unbalanced models can be tested. A design is
balanced if each cell in the model contains the same number of cases.
In addition to testing hypotheses, GLM Univariate produces estimates
of parameters.
10
Commonly used a priori contrasts are available to perform hypothesis
testing. Additionally, after an overall F test has shown significance,
you can use post hoc tests to evaluate differences among specific
means. Estimated marginal means give estimates of predicted mean
values for the cells in the model, and profile plots (interaction plots)
of these means allow you to easily visualize some of the relationships.
Residuals, predicted values, Cook's distance, and leverage values can
be saved as new variables in your data file for checking assumptions.
WLS Weight allows you to specify a variable used to give
observations different weights for a weighted least-squares (WLS)
analysis, perhaps to compensate for a different precision of
measurement.
Example.Data are gathered for individual runners in the Chicago
marathon for several years. The time in which each runner finishes is
the dependent variable. Other factors include weather (cold, pleasant,
or hot), number of months of training, number of previous marathons,
and gender. Age is considered a covariate. You might find that gender
is a significant effect and that the interaction of gender with weather
is significant.
Methods.Type I, Type II, Type III, and Type IV sums of squares can
be used to evaluate different hypotheses. Type III is the default.
Statistics.Post hoc range tests and multiple comparisons: least
significant difference, Bonferroni, Sidak, Scheffé, Ryan-Einot-
Gabriel-Welsch multiple F, Ryan-Einot-Gabriel-Welsch multiple
range, Student-Newman-Keuls, Tukey's honestly significant
difference, Tukey's b, Duncan, Hochberg's GT2, Gabriel, Waller-
Duncan t test, Dunnett (one-sided and two-sided), Tamhane's T2,
Dunnett's T3, Games-Howell, and Dunnett's C. Descriptive statistics:
observed means, standard deviations, and counts for all of the
dependent variables in all cells. The Levene test for homogeneity of
variance.
Plots.Spread-versus-level, residual, and profile (interaction).
11
➢ GLM Univariate Data Considerations
Data. The dependent variable is quantitative. Factors are categorical.
They can have numeric values or string values of up to eight
characters. Covariates are quantitative variables that are related to the
dependent variable.
Assumptions.The data are a random sample from a normal
population; in the population, all cell variances are the same. Analysis
of variance is robust to departures from normality, although the data
should be symmetric. To check assumptions, you can use
homogeneity of variances tests and spread-versus-level plots. You can
also examine residuals and residual plots.
➢ To Obtain GLM Univariate Tables
From the menus choose:
Analyze
General Linear Model
Univariate...
Select a dependent variable.
Select variables for Fixed Factor(s), Random Factor(s), and
Covariate(s), as appropriate for your data.
Optionally, you can use WLS Weight to specify a weight variable
for weighted least-squares analysis. If the value of the weighting
variable is zero, negative, or missing, the case is excluded from the
analysis. A variable already used in the model cannot be used as a
weighting variable.
12
▪ GLM Model
Specify Model. A full factorial model contains all factor main effects,
all covariate main effects, and all factor-by-factor interactions. It does
not contain covariate interactions. Select Custom to specify only a
subset of interactions or to specify factor-by-covariate interactions.
You must indicate all of the terms to be included in the model.
Factors and Covariates. The factors and covariates are listed with
(F) for fixed factor and (C) for covariate. In a Univariate analysis, (R)
indicates a random factor.
Model. The model depends on the nature of your data. After selecting
Custom, you can select the main effects and interactions that are of
interest in your analysis.
Sum of squares. The method of calculating the sums of squares. For
balanced or unbalanced models with no missing cells, the Type III
sum-of-squares method is most commonly used.
Include intercept in model. The intercept is usually included in the
model. If you can assume that the data pass through the origin, you
can exclude the intercept.
▪ Specifying Models for GLM
From the menus choose:
Analyze
General Linear Model
Choose Univariate or Multivariate.
In the dialog box, click Model.
In the Model dialog box, select Custom.
13
Select one or more factors or covariates or a combination of
factors and covariates.
Select a method for building the terms and click the move button.
Repeat until you have all of the terms that you want in the model.
Do not use the same term more than once in the model.
Select a type of sums of squares and whether or not you want the
intercept.
GLM Multivariate is available only if you have the Advanced Models
option installed.
▪ Build Terms
For the selected factors and covariates:
Interaction. Creates the highest-level interaction term of all selected
variables. This is the default.
Main effects. Creates a main-effects term for each variable selected.
All 2-way. Creates all possible two-way interactions of the selected
variables.
All 3-way. Creates all possible three-way interactions of the selected
variables.
All 4-way. Creates all possible four-way interactions of the selected
variables.
All 5-way. Creates all possible five-way interactions of the selected
variables.
14
▪ Sum of Squares
For the model, you can choose a type of sums of squares. Type III is
the most commonly used and is the default.
Type I. This method is also known as the hierarchical decomposition
of the sum-of-squares method. Each term is adjusted for only the term
that precedes it in the model. Type I sums of squares are commonly
used for:
• A balanced ANOVA model in which any main effects are
specified before any first-order interaction effects, any first-
order interaction effects are specified before any second-order
interaction effects, and so on.
• A polynomial regression model in which any lower-order terms
are specified before any higher-order terms.
• A purely nested model in which the first-specified effect is
nested within the second-specified effect, the second-specified
effect is nested within the third, and so on. (This form of nesting
can be specified only by using syntax.)
Type II. This method calculates the sums of squares of an effect in
the model adjusted for all other "appropriate" effects. An appropriate
effect is one that corresponds to all effects that do not contain the
effect being examined. The Type II sum-of-squares method is
commonly used for:
• A balanced ANOVA model.
• Any model that has main factor effects only.
• Any regression model.
• A purely nested design. (This form of nesting can be specified
by using syntax.)
Type III. The default. This method calculates the sums of squares of
an effect in the design as the sums of squares adjusted for any other
effects that do not contain it and orthogonal to any effects (if any) that
contain it. The Type III sums of squares have one major advantage in
15
that they are invariant with respect to the cell frequencies as long as
the general form of estimability remains constant. Hence, this type of
sums of squares is often considered useful for an unbalanced model
with no missing cells. In a factorial design with no missing cells, this
method is equivalent to the Yates' weighted-squares-of-means
technique. The Type III sum-of-squares method is commonly used
for:
• Any models listed in Type I and Type II.
• Any balanced or unbalanced model with no empty cells.
Type IV. This method is designed for a situation in which there are
missing cells. For any effect F in the design, if F is not contained in
any other effect, then Type IV = Type III = Type II. When F is
contained in other effects, Type IV distributes the contrasts being
made among the parameters in F to all higher-level effects equitably.
The Type IV sum-of-squares method is commonly used for:
• Any models listed in Type I and Type II.
• Any balanced model or unbalanced model with empty cells.
▪ GLM Contrasts
Contrasts are used to test for differences among the levels of a factor.
You can specify a contrast for each factor in the model (in a repeated
measures model, for each between-subjects factor). Contrasts
represent linear combinations of the parameters.
GLM Univariate
Hypothesis testing is based on the null hypothesis LB = 0, where L is
the contrast coefficients matrix and B is the parameter vector. When a
contrast is specified, SPSS creates an L matrix in which the columns
corresponding to the factor match the contrast. The remaining
columns are adjusted so that the L matrix is estimable.
16
The output includes an F statistic for each set of contrasts. Also
displayed for the contrast differences are Bonferroni-type
simultaneous confidence intervals based on Student's t distribution.
▪ Design of Experiment :
One-Way Analysis of Variance (ANOVA) examines the differences
between more than two explanatory samples. One-way ANOVA is
used when you have a categorical explanatory variable (with two or
more categories) and a normally distributed interval or ratio
dependent variable.
 Analysis of Variance can be done using one of the following
experimental designs:-
(1)Completely randomized Design:-
In this experimental design, there is only one explanatory
variable with two or more levels (also called treatments or
classifications) and the difference in mean scores of these two or more
explanatory populations is examined.
(2) Randomized Block Design:-
In this experimental design, there is one explanatory variable with two
or more levels, and a second level variable, called blocking variable,
which the researcher wants to control.
(3)Factorial Design: -
There are two or more explanatory level variables in this experimental
design. Every level of an explanatory variable is studied for each level
of remaining (other) explanatory variables. In factorial design, the
impact of two (or more) explanatory variables is examined
simultaneously. If there are two explanatoryvariables, we use Two-
way ANOVA.
17
(1.3.4) Time Series Analysis:
➢ Introduction:
A time series is a set of observations obtained by measuring a single
variable regularly over a period of time. In a series of inventory data,
for example, the observations might represent daily inventory levels
for several months. A series showing the market share of a product
might consist of weekly market share taken over a few years. A series
of total sales figures might consist of one observation per month for
many years. What each of these examples has in common is that some
variable was observed at regular, known intervals over a certain
length of time. Thus, the form of the data for a typical time series is a
single sequence or list of observations representing measurements
taken at regular intervals.
One of the most important reasons for doing time series analysis is to
try to forecast future values of the series. A model of the series that
explained the past values may also predict whether and how much the
next few values will increase or decrease. The ability to make such
predictions successfully is obviously important to any business or
scientific field.
➢ Utility of Time Series Analysis:
The analysis of Time series is useful in many areas, such as
econometrics, commerce, business, meteorology, demography, for the
reasons given below:
1. It gives a general description of the past behavior of the
series: By recording data over a period of time one can easily
understand the changes that have a taken place in the past. In
Table 1. Daily inventory time series
Time Week Day Inventory level
t1 1 Monday 160
t2 1 Tuesday 135
t3 1 Wednesday 129
t4 1 Thursday 122
t5 1 Friday 108
t6 2 Monday 150
...
t60 12 Friday 120
18
other words, time series enables us to study the past behavior of
a phenomenon under consideration.
2. It helps in forecasting the future behavior on the basis of
past behavior: A very important use of time series analysis is to
make forecast about the likely value of a variable in future if the
past behavior continues. The important of forecasting in
business and econometrics fields lies on account of its role in
planning and administration.
3. It facilitates comparison: Once the time series data are
recorded the comparison between the values of the variable at
different time points becomes handy. It helps to compare
variations in the values of a variable over time and analyses the
causes of such variations.
4. It helps in the evaluation of current accomplishments: The
analysis of time series data greatly helps us in the review and
evaluation of progress made in various economies, business and
social activities, for example, the progress of five-year plans
may be judged by studying the variations in the yearly rates of
growth in gross national product (GNP), similarly, the variations
in the general price level indicate the changes in the value of
money over a period of time.
❖ Components of Time Series:
Empirical studies of a number of time series have revealed
thepresence of certain characteristic movements or fluctuations in
atime series these characteristic movements of fluctuations in a time
series. These characteristic movements of a time series may be
classified in four different categories called components of time
series. In a long time series, generally we have the following four
components:
1. Secular Trend or long-term movements
2. Seasonal variations
3. Cyclic variations
4. Random or Irregular movements
19
1.)Secular Trend:
The word trend means ‘tendency’. So, secular trend is that component
of the time series which gives the general tendency of the data for a
long period. It is smooth, regular and long-term movement of a series.
The steady growth of the same status for a particular commodity of a
company or the fall of demand for a certain article for long years can
be studied through secular trend. Do note that rapid fluctuations
cannot give the trend. Growth of population in a locality over decades
is a good example of secular trend.
2.)Seasonal Variations:
If we observe the sale structure of clothes in the market, we will find
that the sale curve is not uniform throughout the year. It shows
different trend in different seasons. It depends entirely on the locality
and the people who reside there. It can also be seen that each and
every year, sale structure is more or less same as the previous year in
those periods. So, this component occurs uniformly and regularly.
This variation is periodic in nature and regular in character.
3.)Cyclic variations:
Apart from seasonal variations, there is another type of fluctuation
which usually lasts for more than a year. This fluctuation is the effect
of business cycles. In every business there are four important phases-
I) prosperity, ii) decline, iii) depression, and v) improvement or
regain. The time from prosperity to regain is a complete cycle. So,
this cycle will never show regular periodicity. A period of a cycle
may differ but, importantly, the sequence of changes should be more
or less regular and it is this fact of regularity which enables us to
study cyclical fluctuations.
20
4.)Random or Irregular movements:
These are, as the name suggests, totally unpredictable. The effects due
to flood, draughts, famines, earthquakes, etc are known as irregular
variations. All variations excluding trend, seasonal and cyclical
variations are irregular. Sometimes cyclical fluctuations too can get
generated from natural calamities, though.
➢ Seasonal Decomposition:
The Seasonal Decomposition procedure decomposes a series into a
seasonal component, a combined trend and cycle component, and an
"error" component. The procedure is an implementation of the Census
Method I, otherwise known as the ratio-to-moving-average method.
Example. A scientist is interested in analyzing monthly
measurements of the ozone level at a particular weather station. The
goal is to determine if there is any trend in the data. In order to
uncover any real trend, the scientist first needs to account for the
variation in readings due to seasonal effects. The Seasonal
Decomposition procedure can be used to remove any systematic
seasonal variations. The trend analysis is then performed on a
seasonally adjusted series.
Statistics. The set of seasonal factors.
Data. The variables should be numeric.
Assumptions. The variables should not contain any embedded
missing data. At least one periodic date component must be defined.
For instructions on handling missing data, see the topic on replacing
missing values.
Here we will discuss about multiplicative and additive model.
The analysis of a time series is the decomposition of a time series into
its different components for their separate study. The process of
analyzing a time series is to isolate and measure its various
21
components. We try to answer the following questions when we
analyze a time series.
1. What would have been the value of the variable at
different points of time if it were influenced only be long
time movements?
2. What changes occur in the value of the variable due to
seasonal variations?
3. To what extent and in what direction has the variable been
affected by cyclical fluctuations?
4. What has been the effect of irregular variations?
The study of a time series is mainly required for estimation and
forecasting. An ideal forecast should base on forecasts of the various
types of fluctuations. Separate forecasts should be made of the trend,
seasonal and cyclical variations. These forecasts become doubtful for
a forecast of irregular movements. Therefore, it is necessary to
separate and measure various types of fluctuations present in a time
series.
A value of a time series variable considered as the resultant of the
combined impact of its components. The components of a time series
follow either the multiplicative or the additive model.
Let Y= original observation, T= trend component, S=seasonal
component, C=cyclical component, and I=irregular component.
Multiplicative Model:
It is assumed that the value Y of a composite series is the product of
the four components. That is
Y=T×S×C×I,
where T is given in original units of Y, but S, C, and I are expressed
as percentage unit-less index numbers.
22
Additive Model:
It is assumed that the value of Y of a composite series is the sum of
the four components. That is
Y=T+S+C+I,
where T, S, C, and I all are given in the original units of Y.
Time series analysis is the analysis of a series of data-points over
time, allowing one to answer a question such as what is the causal
effect on a variable Y of a change in variable X over time? An
important difference between time series and cross-section data is that
the ordering of cases does matter in time series.
Rather than dealing with individuals as units, the unit of interest is
time: the value of Y at time t is Yt. The unit of time can be anything
from days to election years. The value of Yt in the previous period is
called the first lag value: Yt−1. The jth
lag is denoted: Yt−j. Similarly,
Yt+1 is the value of Yt in the next period. So, a simple bivariate
regression equation for time series data looks like:
Yt=β0+βXt+ut
Yt is treated as random variable. If Yt is generated by some model
(Regression model for time series i.e., Yt=xtβ+εt, E(εt|xt) =0, then
ordinary least square (OLS) provides a consistent estimate of β.
➢ Regression:
Regression is a statistical method used in finance, investing, and
other disciplines that attempts to determine the strength and character
of the relationship between one dependent variable (usually denoted
by Y) and a series of other variables (known as explanatory
variables).
Regression Explained
The two basic types of regression are simple linear regression and
multiple linear regression, although there are non-linear regression
23
methods for more complicated data and analysis. Simple linear
regression uses one explanatory variable to explain or predict the
outcome of the dependent variable Y, while multiple linear regression
uses two or more explanatory variables to predict the outcome.
Regression can help finance and investment professionals as well as
professionals in other businesses. Regression can also help predict
sales for a company based on weather, previous sales, GDP growth,
or other types of conditions.
The general form of each type of regression is:
• Simple linear regression: Y = a + bX + u
• Multiple linear regression: Y = a + b1X1 + b2X2 + ... + btXt + u
Where:
• Y = the variable that you are trying to predict (dependent
variable).
• X = the variable that you are using to predict Y (explanatory
variable).
• a = the intercept.
• b = the slope.
• u = the regression residual.
➢ Dummy Technique:
In general, the explanatory variables in any regression analysis are
assumed to be quantitative in nature. For example, the variables like
temperature, distance, age etc. are quantitative in the sense that they
are recorded on a well-defined scale.
In many applications, the variables cannot be defined on a well-
defined scale, and they are qualitative in nature.
For example, the variables like sex (male or female), color (black,
white), nationality, employment status (employed, unemployed) are
defined on a nominal scale. Such variables do not have any natural
24
scale of measurement. Such variables usually indicate the presence or
absence of a “quality” or an attribute like employed or unemployed,
graduate or non-graduate, smokers or non- smokers, yes or no,
acceptance or rejection, so they are defined on a nominal scale. Such
variables can be quantified by artificially constructing the variables
that take the values, e.g., 1 and 0 where “1” usually indicates the
presence of attribute and “0” usually indicates the absence of the
attribute. For example, “1” indicator that the person is male and “0”
indicates that the person is female. Similarly, “1” may indicate that
the person is employed and then “0” indicates that the person is
unemployed.
Such variables classify the data into mutually exclusive categories.
These variables are called indicator variable or dummy variables.
Usually, the indicator variables take on the values 0 and 1 to identify
the mutually exclusive classes of the explanatory variables. For
example,
𝐷= {
1 ;if person is male
0 ;if person is female
𝐷= {
1 ;if person is employed
0 ;if person is unemployed
Here we use the notation D in place of X to denote the dummy
variable. The choice of 1 and 0 to identify a category is arbitrary. For
example, one can also define the dummy variable in the above
examples as
𝐷= {
1 ;if person is male
0 ;if person is female
𝐷= {
1 ;if person is employed
0 ;if person is unemployed
It is also not necessary to choose only 1 and 0 to denote the category.
In fact, any distinct value of D will serve the purpose. The choices of
1 and 0 are preferred as they make the calculations simple, help in the
25
easy interpretation of the values and usually turn out to be a
satisfactory choice.
In a given regression model, the qualitative and quantitative can also
occur together, i.e., some variables are qualitative, and others are
quantitative.
When all explanatory variables are
- quantitative, then the model is called a regression model,
- qualitative, then the model is called an analysis of variance model
and
- quantitative and qualitative both, then the model is called an analysis
of covariance model.
Such models can be dealt with within the framework of regression
analysis. The usual tools of regression analysis can be used in the case
of dummy variables.
➢ PACF Plot:
In time series analysis, the partial autocorrelation function (PACF)
gives the partial correlation of a stationary time series with its own
lagged values, regressed the values of the time series at all shorter
lags. It contrasts with the autocorrelation function, which does not
control for other lags.
This function plays an important role in data analysis aimed at
identifying the extent of the lag in an autoregressive model. The use
of this function was introduced as part of the Box–Jenkins’s approach
to time series modelling, whereby plotting the partial autocorrelative
functions one could determine the appropriate lags p in an AR (p)
model or in an extended ARIMA (p, d, q) model.
26
➢ Simple Exponential Smoothing:
Exponential smoothing is the most widely used class of procedures
for smoothing discrete time series in order to forecast the immediate
future. The idea of exponential smoothing is to smooth the original
series the way the moving average does and to use the smoothed
series in forecasting future values of the variable of interest. In
exponential smoothing, however, we want to allow the more recent
values of the series to have greater influence on the forecast of future
values than the more distant observations.
Exponential smoothing is a simple and pragmatic approach to
forecasting, whereby the forecast is constructed from an exponentially
weighted average of past observations. The largest weight is given to
the present observation, less weight to the immediately preceding
observation, even less weight to the observation before that, and so on
(exponential decay of influence of past data).
Non-Seasonal Simple Exponential Smoothing
This forecasting method is most widely used of all forecasting
techniques. It requires little computation. This method is used when
data pattern is approximately horizontal (i.e., there is no neither cyclic
variation nor pronounced trend in the historical data).
Let an observed time series be y1, y2, …. yn. Formally, the simple
exponential
smoothing equation takes the form of
St+1 = αyt + (1-α) St
Where Si→ The smoothed value of time series at time i
yi→ Actual value of time series at time i
α → Smoothing constant
27
In case of simple exponential smoothing, the smoothed statistic is the
Forecasted value.
Ft+1 = αyt + (1-α) Ft
Where Ft+1→ Forecasted value of time series at time t+1
Ft→ Forecasted value of time series at time t
This means:
Ft = αyt-1 + (1-α) Ft-1
Ft-1 = αyt-1 + (1-α) Ft-2
Ft-2 = αyt-2 + (1-α) Ft-3
Ft-3 = αyt-3 + (1-α) Ft-4
Substituting, Ft+1 = αyt + (1-α) Ft = αyt + (1-α)(αyt-1 + (1-α)Ft-1)
= αyt + α (1-α) yt-1 + (1-α)2
Ft-1
= αyt + α (1-α) yt-1 + α (1-α)2
yt-2 + (1-α)3
Ft-2
= αyt + α (1-α) yt-1 + α (1-α)2
yt-2 + α(1-α)3
yt-3 + (1-α)4
Ft-3
Generalizing,
The series of weights used in producing the forecast Ft are α , α (1-α ),
α(1-α)2
,α(1-α)3
….
These weights decline toward zero in an exponential fashion; thus, as
we go back in the series, each value has a smaller weight in terms of
its effect on the forecast. The exponential decline of the weights
towards zero is evident.
28
Choosing α:
After the model is specified, its performance characteristics should be
verified or validated by comparison of its forecast with historical data
for the process it was designed to forecast.
We can use the error measures such as MAPE (Mean absolute
percentage error), MSE (Mean square error) or RMSE (Root mean
square error) and α is chosen such that the error is minimum.
Usually the MSE or RMSE can be used as the criterion for selecting
an appropriate smoothing constant. For instance, by assigning a value
from 0.1 to 0.99, we select the value that produces the smallest MSE
or RMSE.
Simple exponential smoothing method is used for a time series data
with no trend or seasonality. In this method, a single smoothing factor
or coefficient alpha (α) is used which decides the influence of past
values on the forecast. If α is closer to ‘1’, the forecast is more
impacted by the most recent values than the older values. The
opposite is true if α is close to ‘0’.
Simple. This model is appropriate for series in which there is no trend
or seasonality. Its only smoothing parameter is level. Simple
exponential smoothing is most similar to an ARIMA model with zero
orders of autoregression, one order of differencing, one order of
moving average, and no constant.
29
❖ Inferential Statistics:
With inferential statistics, we try to reach to conclusions that extend
beyond the immediate data alone. This includes different techniques
of estimation and testing of hypothesis.
Run Test:
DEFITNITION: A run is defined as a sequence of like events, items
or symbols that is preceded and/or followed by an event, item or
symbol of different type or by none at all.
For e.g.
1) outcomes of “tossing of a coin”
HHTTTHHHHHTHHH
Here we have 5 runs
2) Sex of a newly born baby
MFFMMFFFFMMMMMF
Here we have 6 runs
Length of the parent run= number of symbols/events/items in the run
We can test the randomness of the of the sequence using runs.
Too many or too few runs indicate lack of randomness in the
sequence.
For e.g., HHHHHHHTTTTTT= number of runs is 2
HTHTHTHTHT= number of runs is 10
Hypothesis:
Ho: The sequence(sample) is random
H1: The sequence (sample) is not random
30
Test Procedure:
1. Convert the observations to + or – signs
+ if observations >Mo (or any cutoff point)
-if observations< Mo (or any cutoff point)
2. Let n1 be the number of symbols of one type and n2 be the number
of
symbols of another type are to be used for testing the hypothesis of
randomness
of the sample
Test Statistic:
R= No. of runs
Decision Rule:
• We reject Ho if there are too many or too few runs in a sequence
• For test of significance level α
Reject Ho if r<= C1 or r>=C2
Where C1 and C2 are such that
P(R≤C1 or R≥C2|Ho)=α where C1 and C2 are critical values which
can be obtained from the table for given n1, n2 and Level of
significance α.
31
SECTION – 2
STATISTICAL
ANALYSIS
32
(2.0)What is statistical analysis?
It’s the science of collecting, exploring and presenting large amounts
of data to discover underlying patterns and trends. Statistics are
applied every day – in research, industry and government – to become
more scientific about decisions that need to be made. For example:
• Manufacturers use statistics to weave quality into beautiful
fabrics, to bring lift to the airline industry and to help guitarists
make beautiful music.
• Researchers keep children healthy by using statistics to analyze
data from the production of viral vaccines, which ensures
consistency and safety.
• Communication companies use statistics to optimize network
resources, improve service and reduce customer churn by
gaining greater insight into subscriber requirements.
• Government agencies around the world rely on statistics for a
clear understanding of their countries, their businesses and their
people.
Look around you. From the tube of toothpaste in your bathroom to the
planes flying overhead, you see hundreds of products and processes
every day that have been improved through the use of statistics.
33
(2.1)Analysis and Interpretation :
(2.1.1) Descriptive Statistics:
Here, we have explanatory variable like…
1. Branch
2. Customer Type
3. Gender
4. Product Line
5. Payment Type
We will visualize every single explanatory variable and interaction of
it using summary table and charts…
1.)Branch wise distribution of sum of cogs:
A B C Grand Total
Sum of cogs 101143.21 101140.64 105303.53 307587.38
INTERPRETATION: From the above Pie chart we can observed that
Branch A and B has almost same sale percentage where Branch C
sales 1 % more than other two branch. But statistically it may not be
significant difference.
A
33%
B
33%
C
34%
SUM OF COGS
34
2.)Customer type wise distribution of sum of cogs:
Member Normal Grand Total
Sum of cogs 156403.28 151184.1 307587.38
INTERPRETATION: From above Pie chart we can observed that the
difference between purchase by members and non-member customers
is 2% only. That may not be statistically significant difference.
3.)Gender wise distribution of sum of cogs:
Female Male Grand Total
Sum of cogs 159888.5 147698.88 307587.38
Member
51%
Normal
49%
SUM OF COGS
Female
52%
Male
48%
SUM OF COGS
35
INTERPRETATION: From above Pie chart we can observed that the
difference between purchase by male and female customers is 4%
only. That may not be statistically significant difference.
4.)Product Line wise distribution of sum of cogs:
Electronic
accessories
Fashion
accessories
Food and
beverages
Health
and
beauty
Home and
lifestyle
Sports and
travel
Grand
Total
Sum of
cogs 51750.03 51719.9 53471.28 46851.18 51297.06 52497.93 307587.38
INTERPRETATION:From the above Pie chart we can observed that
only Health and beauty product sold 2% less than other product. That
may not be statistically significant difference.
5.)Payment type wise distribution of sum of cogs:
Cash
Credit
card Ewallet Grand Total
Sum of cogs 106863.4 95968.64 104755.34 307587.38
17%
17%
17%
15%
17%
17%
SUM OF COGS
Electronic accessories
Fashion accessories
Food and beverages
Health and beauty
Home and lifestyle
Sports and travel
Cash
35%
Credit card
31%
Ewallet
34%
SUM OF COGS
36
INTERPRETATION: From above Pie chart we can observed that
the difference between different types of payment modes is 2 & 3 %
only. That may not be statistically significant difference.
(2.1.2) Dynamic Graphical Representation:
➢ In this section we see all the details in link (web app) on the
below webliography section.
(2.1.3)GLM Univariate Analysis:
➢ In this General Linear Model, we shall focus our discussion on
procedure for
undertaking Randomized Block Design.
ExplanatoryVariable: -
1. Branch
2. Customer Types
3. Gender
4. Product line
5. Payment mode
StudyVariable: -
Sales_price
General Linear Model: -
37
Tests of Between-Subjects Effects
Dependent Variable: Sales Price
Source
Type III Sum of
Squares df Mean Square F Sig.
Corrected Model 10224064.111(a) 212 48226.718 .852 .922
Intercept 70388095.609 1 70388095.609 1243.172 .000
Branch 68983.289 2 34491.645 .609 .544
Customer type 519.526 1 519.526 .009 .924
Gender 32336.112 1 32336.112 .571 .450
Product line 190911.735 5 38182.347 .674 .643
Payment 15559.998 2 7779.999 .137 .872
Branch * Customer type 26436.190 2 13218.095 .233 .792
Branch * Gender 2850.638 2 1425.319 .025 .975
Branch * Product line 514667.881 10 51466.788 .909 .524
Branch * Payment 74001.010 4 18500.252 .327 .860
Customer type * Gender 2994.486 1 2994.486 .053 .818
Customer type * Product line 118382.864 5 23676.573 .418 .836
Customer type * Payment 53461.651 2 26730.826 .472 .624
Gender * Product line 543239.866 5 108647.973 1.919 .089
Gender * Payment 104577.252 2 52288.626 .924 .398
Product line * Payment 204762.231 10 20476.223 .362 .963
Branch * Customer type * Gender 138284.818 2 69142.409 1.221 .295
Branch * Customer type * Product line 760263.025 10 76026.303 1.343 .203
Branch * Customer type * Payment 155830.722 4 38957.681 .688 .600
Branch * Gender * Product line 521714.184 10 52171.418 .921 .513
Branch * Gender * Payment 142596.265 4 35649.066 .630 .641
Branch * Product line * Payment 816445.905 20 40822.295 .721 .807
Customer type * Gender * Product line 70397.832 5 14079.566 .249 .941
Customer type * Gender * Payment 125557.756 2 62778.878 1.109 .330
Customer type * Product line * Payment 337879.135 10 33787.914 .597 .817
Gender * Product line * Payment 312538.102 10 31253.810 .552 .853
Branch * Customer type * Gender * Product
line
203047.194 10 20304.719 .359 .964
Branch * Customer type * Gender * Payment 80392.971 4 20098.243 .355 .841
Branch * Customer type * Product line *
Payment
1536469.417 20 76823.471 1.357 .136
Branch * Gender * Product line * Payment 1235395.321 20 61769.766 1.091 .353
Customer type * Gender * Product line *
Payment
742647.567 10 74264.757 1.312 .220
Branch * Customer type * Gender * Product
line * Payment
529382.133 17 31140.125 .550 .927
Error 44559734.909 787 56619.739
Total 149393795.355 1000
Corrected Total 54783799.020 999
The standard error of the model is very high and only intercept is statistically significant so it is clear that
the model is insignificant as well as all the main effects and interaction effects are also statistically
insignificant.
38
Hypothesis:-
(A) H0 : There is no difference between All Main Effects
towards sales.
H1 : There is difference between All Main Effects
towards sales.
AND
(B) H0 : There is no difference between All Interaction
Effects towards sales.
H1 : There is difference between All Interaction Effects
towards sales.
Conclusion: -
(A)
Here, All P-value Of All MainEffects > α; therefore, the data
provides enough evidence to do not reject the null hypothesis at 5%
level of significance. (α = 0.05)
Hence, there is no difference between All Main Effects towards
sales.
(B)
Here, All P-value Of All Interaction Effects > α; therefore, the
data provides enough evidence to do not reject the null hypothesis at
5% level of significance. (α = 0.05)
Hence, there is no difference between All Interaction Effects
towards sales.
39
(2.1.4) Time Series Analysis:
• Introduction:
A time series is a set of observations obtained by measuring a single
variable regularly over a period of time. In a series of inventory data,
for example, the observations might represent daily inventory levels
for several months. A series showing the market share of a product
might consist of weekly market share taken over a few years. A series
of total sales figures might consist of one observation per month for
many years. What each of these examples has in common is that some
variable was observed at regular, known intervals over a certain
length of time. Thus, the form of the data for a typical time series is a
single sequence or list of observations representing measurements
taken at regular intervals.
There are two main goals of time series analysis: identifying the
nature of the phenomenon represented by the sequence of
observations, and forecasting (predicting future values of the time
series variable).
• Descriptive:
We have Super-Market sales of Three Branch which are located in
city Yangon (Branch A), Mandalay (Branch B) and Naypyitaw
(Branch C). We have sales of 89 days (01-01-2019 to 30-03-2019)
and named as “Cogs1” for Branch A, “Cogs2” for Branch B and
“Cogs3” for Branch C. Where Cogs stand for “Cost Of Goods Sold”.
Path: SPSS (Analyze → Descriptive Statistics)
Descriptive Statistics
89 2950.84 148.67 3099.51 101143.21 1136.4406 703.48064 .780 .255 -.012 .506
89 3402.15 .00 3402.15 101140.64 1136.4117 866.10639 .862 .255 .027 .506
89 3459.88 .00 3459.88 105303.53 1183.1857 733.52896 .781 .255 .641 .506
89
cogs1
cogs2
cogs3
Valid N (listwise)
Statistic Statistic Statistic Statistic Statistic Statistic Statistic Statistic Std. Error Statistic Std. Error
N Range Minimum Maximum Sum Mean Std.
Deviation
Skewness Kurtosis
40
Here, we notice that in Cogs2 and Cogs3 minimum value is 0 (Zero),
because in Cogs2 11-Jan, 23-Jan and 1-Fab sales is zero, Also in
Cogs3 22-March sale is zero.
Also, we have to notice that rage is very large and standard division is
very high.
• Graphical Representation:
Using Graph, we can easily understand the pattern of data. And we
can decide the further step to analyze our “Time Series Data”.
Path:SPSS (Analyze → Time Series → Sequence chart)
41
From the above all graph we can observe that their no upward or
downward trend in data. Also, from graph pattern there may exists
Seasonality, Cyclic or Irregular (Random) component in data.
• Seasonal Decomposition:
Note : - Here, we have only three-month (89 days) data so, we can’t
talk about seasonal effect. In Seasonal decomposition we check only
cyclic effect.
→SAF: Seasonal adjustment factors, representing seasonal variation.
For the multiplicative model, the value 1 represents the absence of
seasonal variation; for the additive model, the value 0 represents the
absence of seasonal variation.
Path: SPSS (Analyze → Time Series → Seasonal Decomposition)
→SAS. Seasonally adjusted series, representing the original series
with seasonal variations removed. Working with a seasonally adjusted
series, for example, allows a trend component to be isolated and
analyzed explanatory of any seasonal component.
SAS =
Orignal Series
SAF of it`s Period
→ STC. Smoothed trend-cycle component, which is a smoothed
version of the seasonally adjusted series that shows both trend and
cyclic components.
STC =
SAS
ERR
→ ERR. The residual component of the series for a particular
observation.
✓ Now, we observed our data decomposition one-by-one,
and checking which type is it…
42
1) Cogs1 :
Now, we observed STC (Smoothed trend-cycle) which display Trend and Cyclic
component in series.
Conclusion : Here, from above graph we observed that there is no any Cyclic effect.
2) Cogs2 :
Here, we observed STC (Smoothed trend-cycle) which display Trend and Cyclic
component in series.
Conclusion: Here, from above graph we observed that there is no any Cyclic effect.
3) Cogs3 :
Here, we observed STC (Smoothed trend-cycle) which display Trend and Cyclic
component in series.
Conclusion :Here, from above graph we observed that there is no any Cyclic effect.
0
2000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88
COGS
Days
STC_1
STC_1 Linear (STC_1)
0
2000
4000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88
cogs
days
STC_2
STC_2 Linear (STC_2)
0
1000
2000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88
cogs
days
STC_3
STC_3 Linear (STC_3)
43
• Regression over Cyclic Dummy:
Now, we try to fit Linear model using cyclic dummy over Cogs. Here
we create dummy over day of weeks. Dummies are as
following…[Path:SPSS (Transform → Compute variable → use “Any” function)]
𝐷1= {
1 ;if Sunday
0 ;elsewhere
𝐷2= {
1 ;if Monday
0 ;elsewhere
𝐷3= {
1 ;if Tuesday
0 ;elsewhere
𝐷4= {
1 ; if Wednesday
0 ; elsewhere
𝐷5= {
1 ;if Thursday
0 ;elsewhere
𝐷6= {
1 ;if Friday
0 ;elsewhere
1.)Cogs1 :
Conclusion : Here, we can observe that model is insignificant. So, we
can say that in our data there is no Weekly Cyclic Component.
Model Summary
b
.159a .025 -.046 719.54104 2.048
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Durbin-
Watson
Predictors: (Constant), D7, D6, D5, D4, D2, D1
a.
Dependent Variable: cogs1
b.
ANOVA
b
1095258 6 182542.942 .353 .906a
42454624 82 517739.312
43549881 88
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), D7, D6, D5, D4, D2, D1
a.
Dependent Variable: cogs1
b.
Coefficientsa
1130.084 207.714 5.441 .000
-34.464 288.047 -.017 -.120 .905 .562 1.779
-158.090 288.047 -.080 -.549 .585 .562 1.779
-.247 288.047 .000 -.001 .999 .562 1.779
-64.600 288.047 -.033 -.224 .823 .562 1.779
98.151 288.047 .050 .341 .734 .562 1.779
219.663 293.751 .107 .748 .457 .578 1.730
(Constant)
D1
D2
D3
D4
D5
D6
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: cogs1
a.
44
2.)Cogs2:
Conclusion : Here, we can observe that model is insignificant. So, we
can say that in our data there is no Weekly Cyclic Component.
3.)Cogs3:
Model Summary
b
.278a .077 .009 861.98289 2.083
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Durbin-
Watson
Predictors: (Constant), D6, D5, D4, D2, D3, D1
a.
Dependent Variable: cogs2
b.
ANOVA
b
5085155 6 847525.902 1.141 .346a
60927190 82 743014.511
66012345 88
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), D6, D5, D4, D2, D3, D1
a.
Dependent Variable: cogs2
b.
Coefficientsa
1010.691 248.833 4.062 .000
370.938 345.069 .152 1.075 .286 .562 1.779
-79.732 345.069 -.033 -.231 .818 .562 1.779
145.252 345.069 .060 .421 .675 .562 1.779
45.555 345.069 .019 .132 .895 .562 1.779
548.608 345.069 .225 1.590 .116 .562 1.779
-184.078 351.903 -.073 -.523 .602 .578 1.730
(Constant)
D1
D2
D3
D4
D5
D6
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: cogs2
a.
Model Summary
b
.231a .053 -.016 739.42112 2.192
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Durbin-
Watson
Predictors: (Constant), D6, D5, D4, D2, D3, D1
a.
Dependent Variable: cogs3
b.
ANOVAb
2516722 6 419453.743 .767 .598a
44832975 82 546743.592
47349697 88
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), D6, D5, D4, D2, D3, D1
a.
Dependent Variable: cogs3
b.
45
Conclusion : Here, we can observe that model is insignificant. So, we
can say that in our data there is no Weekly Cyclic Component.
• PACF Graph:
Here, we compute PACF Graph, we already check 1st order
autocorrelation in above Regression using DW statistic and all study
variable (Cogs1, Cogs2 and Cogs3) have DW statistic near 2. So,
there is no 1st Order Autocorrelation.
To, check further order Autocorrelation we simply plot PACF Graph
using SPSS.
Path:SPSS (Analyze → Time Series → Autocorrection)
Coefficients
a
867.088 213.452 4.062 .000
427.256 296.005 .207 1.443 .153 .562 1.779
433.704 296.005 .210 1.465 .147 .562 1.779
169.421 296.005 .082 .572 .569 .562 1.779
229.228 296.005 .111 .774 .441 .562 1.779
456.792 296.005 .221 1.543 .127 .562 1.779
484.955 301.867 .227 1.607 .112 .578 1.730
(Constant)
D1
D2
D3
D4
D5
D6
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: cogs3
a.
46
Conclusion: Here, from PACF graph we can observe that our all-
study variable (Cogs1, Cogs2 & Cogs3)has no any order
Autocorrection or we can say 0th
order Autocorrection.
• Model Fitting:
As, above we analysed our study variable we notice that in our study
variable there is No Trend, No Seasonality and No Cyclic effect.
So, we can say that our data is Irregular or Random. So, to confirmed
that we compute a Run Test over our study variable.
Run test is a statistical test used to determine of the data obtained
from a sample is random. That is why it is called Run Test for
Randomness. Randomness of the data is determined based on the
number and nature of runs present in the data of interest.
47
Conclusion: Here, from above table we can see all study variable are
Insignificant. So, conclude that our study variable is Random.
So, we apply Simple (Single) Exponential Smoothing model. Because
in our study variable there is no trend and no seasonality.
The daily change in the Cogs1, Cogs2 & Cogs3 has no trend,
seasonality or cyclic behaviour. There are random fluctuations which
do not appear to be very predictable, and no strong patterns that
would help with developing a forecasting model.
NON-SEASONALSIMPLE EXPONENTIAL SMOOTHING
Path:SPSS (Analyze→ Time Series → Create Model →Method: Exponential Smoothing: Simple Non-Seasonal)
Runs Test
1136.4406 1136.4117 1183.1857
53 54 43
36 35 46
89 89 89
43 45 50
-.194 .342 .971
.846 .733 .331
Test Valuea
Cases < Test Value
Cases >= Test Value
Total Cases
Number of Runs
Z
Asymp. Sig. (2-tailed)
cogs1 cogs2 cogs3
Mean
a.
Model Description
Simple
Simple
Simple
Model_1
cogs1
Model_2
cogs2
Model_3
cogs3
Model
ID
Model Type
Model Fit
.529 .015 .516 .546 .516 .516 .516 .526 .546 .546 .546
-.015 .004 -.020 -.011 -.020 -.020 -.020 -.014 -.011 -.011 -.011
773.693 88.916 707.504 874.761 707.504 707.504 707.504 738.814 874.761 874.761 874.761
138.842 44.686 94.651 184.007 94.651 94.651 94.651 137.868 184.007 184.007 184.007
2641.911 1740.197 679.726 3998.047 679.726 679.726 679.726 3247.960 3998.047 3998.047 3998.047
625.655 76.534 580.425 714.021 580.425 580.425 580.425 582.519 714.021 714.021 714.021
2124.862 180.713 1916.833 2243.021 1916.833 1916.833 1916.833 2214.733 2243.021 2243.021 2243.021
13.344 .224 13.174 13.598 13.174 13.174 13.174 13.261 13.598 13.598 13.598
Fit Statistic
Stationary R-squared
R-squared
RMSE
MAPE
MaxAPE
MAE
MaxAE
Normalized BIC
Mean SE Minimum Maximum 5 10 25 50 75 90 95
Percentile
48
Here, we can observe from Model Fit table MAPE = 184.007, means
there is 184.007% error in model fitting.
Model Statistics
0 .516 -.011 15.584 17 .553 0
0 .526 -.020 15.208 17 .581 0
0 .546 -.014 14.749 17 .614 0
Model
cogs1-Model_1
cogs2-Model_2
cogs3-Model_3
Number of
Predictors
Stationary
R-squared R-squared
Model Fit statistics
Statistics DF Sig.
Ljung-Box Q(18)
Number of
Outliers
Forecast
1165.40 1165.40 1165.40 1165.40 1165.40 1165.40 1165.40 1165.40 1165.40
2571.41 2571.46 2571.51 2571.55 2571.60 2571.65 2571.69 2571.74 2571.79
-240.62 -240.67 -240.71 -240.76 -240.81 -240.86 -240.90 -240.95 -241.00
1133.98 1133.98 1133.98 1133.98 1133.98 1133.98 1133.98 1133.98 1133.98
2872.38 2872.69 2873.01 2873.32 2873.63 2873.94 2874.26 2874.57 2874.88
-604.43 -604.74 -605.05 -605.37 -605.68 -605.99 -606.30 -606.62 -606.93
1197.97 1197.97 1197.97 1197.97 1197.97 1197.97 1197.97 1197.97 1197.97
2666.20 2666.31 2666.41 2666.51 2666.62 2666.72 2666.82 2666.92 2667.03
-270.27 -270.37 -270.48 -270.58 -270.68 -270.78 -270.89 -270.99 -271.09
Forecast
UCL
LCL
Forecast
UCL
LCL
Forecast
UCL
LCL
Model
cogs1-Model_1
cogs2-Model_2
cogs3-Model_3
13 Fri 13 Sat 14 Sun 14 Mon 14 Tue 14 Wed 14 Thu 14 Fri 14 Sat
For each model, forecasts start after the last non-missing in the range of the requested estimation period, and end at the last period for which
non-missing values of all the predictors are available or at the end date of the requested forecast period, whichever is earlier.
49
That is, all forecasts take the same value, equal to the last level
component. It’s also called Flat Forecast.
Simple exponential smoothing has a “flat” forecast function:
Here, we Forecast for up to 14weeks using Simple Exponential
Smoothing Method.
50
(2.2)Findings :
(2.2.1) Descriptive:
1. We can observe that Branch A and B has almost same sale
percentage where Branch C sales more than other two
branch.
2. We can observe that Member are purchases more than
Normal customer.
3. We can observe that Females are purchases more than
Male customer.
4. We can observe that only Health and beauty product sold
2% less than other product.
5. we can observe that customer more prefer to buy product
using Cash Payment type.
(2.2.2) Dynamic Graphical Representation:
1. We can observe that the dynamic result by hanging on the
graph.
2. We can observe that the bar chart race will provide
information about different Branches, Cities, Customer
type, Gender, Product line and Payment mode which give
the total sales information, distributed by date increase.
3. We can observe that the pie chart will provide information
about different Branches, Cities, Customer type, Gender,
Product line and Payment mode which give the total sales
information, distributed by percentage or proportional
data.
51
(2.2.3)Univariate GLM:
1. There is no significant effect of explanatory variables
(Branch, Customer type, Gender, Product line and
Payment type) Main and Interaction effect on study
variable (Sales_Price).
2. The main aspect that we have consider in our project is
the smart application of General Linear Model with
dummy variable. This enables us to decide about the
significant effect of all the main effects, interaction effects
as well as categorical variables at a time. Thus, we can
avoid individual testing & can save our time and energy.
(2.2.4)Time Series Analysis:
1. We can observe that their no upward or downward trend in
data. Also, from graph pattern there may exists
Seasonality, Cyclic or Irregular (Random) component in
data.
2. From Seasonal Decomposition we observed that there is
no Trend or Cyclic component in Cogs1, Cogs2 and
Cogs3.
3. From Regression over cyclic dummy, we can observe that
model is insignificant. So, we can say that in our data there
is no Weekly Cyclic Component.
4. From PACF graph we can observe that our all-study
variable (Cogs1, Cogs2 & Cogs3) has no any order
Autocorrection or we can say 0th order Autocorrection.
52
5. The daily change in the Cogs1, Cogs2 & Cogs3 has no
trend, seasonality or cyclic behaviour. There are random
fluctuations which do not appear to be very predictable,
and no strong patterns that would help with developing a
forecasting model.
6. In terms of forecasting, simple exponential smoothing
generates a constant set of values. All forecasts equal the
last value of the level component. Consequently, these
forecasts are appropriate only when your time series data
have no trend or seasonality.
53
(2.3)Limitation :
• This Kaggle Supermarket sales data is not Original or Natural
sales, So the Statistical analysis not justify in real life.
• Also, in data we have product categories not actual product name,
if we have actual product so we can also analysis “Market Basket
Analysis”.
• Here, we have only 3-month sales (89 days), if we have at least
one-year sales so we can analyze monthly seasonal.
54
REFERENCE
BOOKS:
• Basic Econometrics by Damodar Gujarati
• Programme Statistics by B.L. Agrawal
• Statistics In Management Studies by K.K. Sharma
WEBLIOGRAPHY:
• https://epgp.inflibnet.ac.in/Home/ViewSubject?catid=34
• http://www.sussex.ac.uk/its/pdfs/SPSS_Forecasting_22.pdf
• https://gilberttanner.com/blog/turn-your-data-science-script-
into-websites-with-streamlit
• https://www.kaggle.com/aungpyaeap/supermarket-sales/
(Data File)
• https://myprojectreport.herokuapp.com/ (Dynamic Graphical Representation)
SOFTWARES:
• SPSS
• SYSTAT
• R SOFTWARE
• MS EXCEL
• PYTHON
55
THANK YOU

More Related Content

What's hot

BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxBIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxLSURYAPRAKASHREDDY
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Data Visualization Techniques
Data Visualization TechniquesData Visualization Techniques
Data Visualization TechniquesAllAnalytics
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket AnalysisSandeep Prasad
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket AnalysisNarayan Vyas
 
Analytics & retail analytics
Analytics & retail analyticsAnalytics & retail analytics
Analytics & retail analyticsDale Sternberg
 
Ppt dmart & vishal mega mart
Ppt dmart & vishal mega martPpt dmart & vishal mega mart
Ppt dmart & vishal mega martzarusolanki
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesT.S. Lim
 
Big bazaar and_d-mart
Big bazaar and_d-martBig bazaar and_d-mart
Big bazaar and_d-martSarla Jaiswal
 
Big Data Analytics and its Application in E-Commerce
Big Data Analytics and its Application in E-CommerceBig Data Analytics and its Application in E-Commerce
Big Data Analytics and its Application in E-CommerceUyoyo Edosio
 
Market Segmentation and Market Basket Analysis
Market Segmentation and Market Basket AnalysisMarket Segmentation and Market Basket Analysis
Market Segmentation and Market Basket AnalysisSpotle.ai
 
A comparative study on online and offline shopping
A comparative  study on online and offline shoppingA comparative  study on online and offline shopping
A comparative study on online and offline shoppingSumitKumar801561
 
Swiggy Market Analysis
Swiggy Market AnalysisSwiggy Market Analysis
Swiggy Market AnalysisParvesh Mourya
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerceAlexander Konduforov
 

What's hot (20)

BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxBIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptx
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Data Visualization Techniques
Data Visualization TechniquesData Visualization Techniques
Data Visualization Techniques
 
Predictive Analytics - An Introduction
Predictive Analytics - An IntroductionPredictive Analytics - An Introduction
Predictive Analytics - An Introduction
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Analytics & retail analytics
Analytics & retail analyticsAnalytics & retail analytics
Analytics & retail analytics
 
Ppt dmart & vishal mega mart
Ppt dmart & vishal mega martPpt dmart & vishal mega mart
Ppt dmart & vishal mega mart
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
Big bazaar and_d-mart
Big bazaar and_d-martBig bazaar and_d-mart
Big bazaar and_d-mart
 
Market baasket analysis
Market baasket analysisMarket baasket analysis
Market baasket analysis
 
Big Data Analytics and its Application in E-Commerce
Big Data Analytics and its Application in E-CommerceBig Data Analytics and its Application in E-Commerce
Big Data Analytics and its Application in E-Commerce
 
Data visualization-tools
Data visualization-toolsData visualization-tools
Data visualization-tools
 
7 steps to Predictive Analytics
7 steps to Predictive Analytics 7 steps to Predictive Analytics
7 steps to Predictive Analytics
 
Market Segmentation and Market Basket Analysis
Market Segmentation and Market Basket AnalysisMarket Segmentation and Market Basket Analysis
Market Segmentation and Market Basket Analysis
 
A comparative study on online and offline shopping
A comparative  study on online and offline shoppingA comparative  study on online and offline shopping
A comparative study on online and offline shopping
 
Swiggy Market Analysis
Swiggy Market AnalysisSwiggy Market Analysis
Swiggy Market Analysis
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerce
 

Similar to A Statistical Analysis on Supermarket Sales

Business Plan ICADDY et retail Analytics
Business Plan ICADDY et retail AnalyticsBusiness Plan ICADDY et retail Analytics
Business Plan ICADDY et retail AnalyticsMikaël Monjour
 
Consumer Analytics A Primer
Consumer Analytics A PrimerConsumer Analytics A Primer
Consumer Analytics A Primerijtsrd
 
Shrinking big data for real time marketing strategy - A statistical Report
Shrinking big data for real time marketing strategy - A statistical ReportShrinking big data for real time marketing strategy - A statistical Report
Shrinking big data for real time marketing strategy - A statistical ReportManidipa Banerjee
 
Final Internship Report_Sachin Serigar
Final Internship Report_Sachin SerigarFinal Internship Report_Sachin Serigar
Final Internship Report_Sachin SerigarSachin Serigar
 
24790 Business Project Final Marketing PROJECT PROPOSALSTEPS .docx
24790 Business Project Final Marketing PROJECT PROPOSALSTEPS .docx24790 Business Project Final Marketing PROJECT PROPOSALSTEPS .docx
24790 Business Project Final Marketing PROJECT PROPOSALSTEPS .docxdomenicacullison
 
A Study on marketing mix & competitive analysis of “Pure it” (HUL)
A Study on marketing mix & competitive analysis of “Pure it” (HUL)A Study on marketing mix & competitive analysis of “Pure it” (HUL)
A Study on marketing mix & competitive analysis of “Pure it” (HUL)jitu9030394490
 
Customer Experience Improvement: Finding the Right Data Strategy
Customer Experience Improvement: Finding the Right Data StrategyCustomer Experience Improvement: Finding the Right Data Strategy
Customer Experience Improvement: Finding the Right Data Strategysuitecx
 
MicroStrategy BI Solutions for Retail Industry
MicroStrategy BI Solutions for Retail IndustryMicroStrategy BI Solutions for Retail Industry
MicroStrategy BI Solutions for Retail IndustryBiBoard.Org
 
Global Open Loop Prepaid Cards Market Intelligence, Innovation, Strategy, and...
Global Open Loop Prepaid Cards Market Intelligence, Innovation, Strategy, and...Global Open Loop Prepaid Cards Market Intelligence, Innovation, Strategy, and...
Global Open Loop Prepaid Cards Market Intelligence, Innovation, Strategy, and...MarketResearch.com
 
An impact of knowledge mining on satisfaction of consumers in super bazaars
An impact of knowledge mining on satisfaction of consumers in super bazaarsAn impact of knowledge mining on satisfaction of consumers in super bazaars
An impact of knowledge mining on satisfaction of consumers in super bazaarsIAEME Publication
 
Big Data in Retail (White paper)
Big Data in Retail (White paper)Big Data in Retail (White paper)
Big Data in Retail (White paper)InData Labs
 
Mis and consumer buying behaviour
Mis and consumer buying behaviourMis and consumer buying behaviour
Mis and consumer buying behaviourMayanka Singh
 
Relation of Big Data and E-Commerce
Relation of Big Data and E-CommerceRelation of Big Data and E-Commerce
Relation of Big Data and E-CommerceAnkita Tiwari
 
Big bazaar and_d-mart
Big bazaar and_d-martBig bazaar and_d-mart
Big bazaar and_d-martZameer Mirza
 
Analysis of the Awareness Level of Customers about the different Retailing Te...
Analysis of the Awareness Level of Customers about the different Retailing Te...Analysis of the Awareness Level of Customers about the different Retailing Te...
Analysis of the Awareness Level of Customers about the different Retailing Te...AI Publications
 
Drive your business with predictive analytics
Drive your business with predictive analyticsDrive your business with predictive analytics
Drive your business with predictive analyticsThe Marketing Distillery
 
MKT574 v1Strategic Marketing PlanMKT574 v1Page 6 of 6
MKT574 v1Strategic Marketing PlanMKT574 v1Page 6 of 6MKT574 v1Strategic Marketing PlanMKT574 v1Page 6 of 6
MKT574 v1Strategic Marketing PlanMKT574 v1Page 6 of 6IlonaThornburg83
 
quantitative methods in market research
quantitative methods in market researchquantitative methods in market research
quantitative methods in market researchPaniz Donyadari
 
CONSUMER BUYING BEHAVIOR BASED ON DEMOGRAPHY
CONSUMER BUYING BEHAVIOR BASED ON DEMOGRAPHYCONSUMER BUYING BEHAVIOR BASED ON DEMOGRAPHY
CONSUMER BUYING BEHAVIOR BASED ON DEMOGRAPHYArkabrata Bandyapadhyay
 
Hul 101128100726-phpapp01
Hul 101128100726-phpapp01Hul 101128100726-phpapp01
Hul 101128100726-phpapp01Jitender Kumar
 

Similar to A Statistical Analysis on Supermarket Sales (20)

Business Plan ICADDY et retail Analytics
Business Plan ICADDY et retail AnalyticsBusiness Plan ICADDY et retail Analytics
Business Plan ICADDY et retail Analytics
 
Consumer Analytics A Primer
Consumer Analytics A PrimerConsumer Analytics A Primer
Consumer Analytics A Primer
 
Shrinking big data for real time marketing strategy - A statistical Report
Shrinking big data for real time marketing strategy - A statistical ReportShrinking big data for real time marketing strategy - A statistical Report
Shrinking big data for real time marketing strategy - A statistical Report
 
Final Internship Report_Sachin Serigar
Final Internship Report_Sachin SerigarFinal Internship Report_Sachin Serigar
Final Internship Report_Sachin Serigar
 
24790 Business Project Final Marketing PROJECT PROPOSALSTEPS .docx
24790 Business Project Final Marketing PROJECT PROPOSALSTEPS .docx24790 Business Project Final Marketing PROJECT PROPOSALSTEPS .docx
24790 Business Project Final Marketing PROJECT PROPOSALSTEPS .docx
 
A Study on marketing mix & competitive analysis of “Pure it” (HUL)
A Study on marketing mix & competitive analysis of “Pure it” (HUL)A Study on marketing mix & competitive analysis of “Pure it” (HUL)
A Study on marketing mix & competitive analysis of “Pure it” (HUL)
 
Customer Experience Improvement: Finding the Right Data Strategy
Customer Experience Improvement: Finding the Right Data StrategyCustomer Experience Improvement: Finding the Right Data Strategy
Customer Experience Improvement: Finding the Right Data Strategy
 
MicroStrategy BI Solutions for Retail Industry
MicroStrategy BI Solutions for Retail IndustryMicroStrategy BI Solutions for Retail Industry
MicroStrategy BI Solutions for Retail Industry
 
Global Open Loop Prepaid Cards Market Intelligence, Innovation, Strategy, and...
Global Open Loop Prepaid Cards Market Intelligence, Innovation, Strategy, and...Global Open Loop Prepaid Cards Market Intelligence, Innovation, Strategy, and...
Global Open Loop Prepaid Cards Market Intelligence, Innovation, Strategy, and...
 
An impact of knowledge mining on satisfaction of consumers in super bazaars
An impact of knowledge mining on satisfaction of consumers in super bazaarsAn impact of knowledge mining on satisfaction of consumers in super bazaars
An impact of knowledge mining on satisfaction of consumers in super bazaars
 
Big Data in Retail (White paper)
Big Data in Retail (White paper)Big Data in Retail (White paper)
Big Data in Retail (White paper)
 
Mis and consumer buying behaviour
Mis and consumer buying behaviourMis and consumer buying behaviour
Mis and consumer buying behaviour
 
Relation of Big Data and E-Commerce
Relation of Big Data and E-CommerceRelation of Big Data and E-Commerce
Relation of Big Data and E-Commerce
 
Big bazaar and_d-mart
Big bazaar and_d-martBig bazaar and_d-mart
Big bazaar and_d-mart
 
Analysis of the Awareness Level of Customers about the different Retailing Te...
Analysis of the Awareness Level of Customers about the different Retailing Te...Analysis of the Awareness Level of Customers about the different Retailing Te...
Analysis of the Awareness Level of Customers about the different Retailing Te...
 
Drive your business with predictive analytics
Drive your business with predictive analyticsDrive your business with predictive analytics
Drive your business with predictive analytics
 
MKT574 v1Strategic Marketing PlanMKT574 v1Page 6 of 6
MKT574 v1Strategic Marketing PlanMKT574 v1Page 6 of 6MKT574 v1Strategic Marketing PlanMKT574 v1Page 6 of 6
MKT574 v1Strategic Marketing PlanMKT574 v1Page 6 of 6
 
quantitative methods in market research
quantitative methods in market researchquantitative methods in market research
quantitative methods in market research
 
CONSUMER BUYING BEHAVIOR BASED ON DEMOGRAPHY
CONSUMER BUYING BEHAVIOR BASED ON DEMOGRAPHYCONSUMER BUYING BEHAVIOR BASED ON DEMOGRAPHY
CONSUMER BUYING BEHAVIOR BASED ON DEMOGRAPHY
 
Hul 101128100726-phpapp01
Hul 101128100726-phpapp01Hul 101128100726-phpapp01
Hul 101128100726-phpapp01
 

Recently uploaded

DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 

Recently uploaded (17)

DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 

A Statistical Analysis on Supermarket Sales

  • 1. A PROJECT REPORT ON “A STATISTICAL ANALYSIS OF SALES FROM SUPERMARKET” SUBMITTED BY CHAUDHARI SURAJKUMAR D. HIRAPARA HIREN M. MISTRY RADHESH S. NADKARNI SAHIL K. PASI VIPULKUMAR K. PATEL JINAL D. IN PARTIAL FULFILLMENT OF THE DEGREE OF MASTER OF SCIENCE IN STATISTICS GUIDED BY Dr. ARTI RAJYAGURU DEPARTMENT OF STATISTICS VEER NARMAD SOUTH GUJARAT UNIVERSITY SURAT 2021
  • 2. Certificate DEPARTMENT OF STATISTICS VEER NARMAD SOUTH GUJARAT UNIVERSITY, SURAT (Re-Accredited with 'Grade- A' by NAAC) This is to certify that project on "A Statistical Analysis of Sales from Supermarket" submitted by “Chaudhari Suraj D. (Roll No.03), Hirapara Hiren M. (Roll No,05), Mistry Radhesh S. (Roll No.06), Nadkarni Sahil K. (Roll No.07), Pasi Vipul k. (Roll No.12), and Patel Jinal D. (Roll No. 13) students of M.Sc. Statistics (Semester- IV)" for the academic year 2020-21, to the department of Statistics V. N. S. G. University, Surat as a partial fulfillrnent for the degree of M.Sc. (Statistics). PROF. & HEAD Department of Statistics Veer Narmad South Gujarat University Surat-395007
  • 3. THIS PROJECT IS DEDICATED TO THE DEPARTMENT OF STATISTICS ALL OUR PROFESSORS OUR GUIDE AND OUR GROUP
  • 4. ACKNOWLEDGEMENT We are highly grateful to the honorable Dr. A. J. Rajyaguru, The Head of The Department of Statistics, V.N.S.G.U. - Surat, for her ever-helping attitude and encouraging us to achieve excel in studies.She has not only made us to work but also guided us to orient towards research. We are also thankful to the entire staff of the department and to all those who have helped us or supported us directly or indirectly. This acknowledgment will not be complete until we pay our gratitude to our family, whose enthusiasm to see this work complete was as infectious as their inspiration. (CHAUDHARI SURAJKUMAR D.) (HIRAPARA HIREN M.) (MISTRY RADHESH S.) (NADKARNI SAHIL K.) (PASI VIPULKUMAR K.) (PATEL JINAL D.)
  • 5. DECLARATION We, the students of M.Sc. (Statistics) [Semester IV] of the Department of Statistics at VEER NARMAD SOUTH GUJARAT UNIVERSITY, Surat, hereby declare that we have completed our project entitled “A STATISTICAL ANALYSIS OFSALES FROM SUPERMARKET”in the Academic year 2020-21. The information submitted here in is true and original to the best of our knowledge. (CHAUDHARI SURAJKUMAR D.) (HIRAPARA HIREN M.) (MISTRY RADHESH S.) (NADKARNI SAHIL K.) (PASI VIPULKUMAR K.) (PATEL JINAL D.)
  • 6. INDEX SECTIONS TITLE PAGE NO 1 INTRODUCTION 1 1.0 INTRODUCTION OF SUPERMARKET 2 1.1 OBJECTIVES OF THE STUDY 4 1.2DATA COLLECTION 5 1.3STATISTICAL TECHNIQUES 8 2 STATISTICAL ANALYSIS 31 2.0 WHAT IS STATISTICAL ANALYSIS? 32 2.1 ANALYSIS AND INTERPRETATION 33 2.2FINDINGS 50 2.3LIMITATIONS 53 • REFERENCES 54
  • 8. 2 Statistics plays a vital role in our day-to-day life. Statistics has been defined by different authors for a variety of definitions. Also, the varied and outstanding contribution of Prof. “R.A. Fisher” put the subject of statistics on a very firm footing and earned the status of fully fledged science. According to BOWLEY “Statistics are numerical statement of facts in any department of equity placed in relation to each other.” “Statistics is the grammar of science.” “A research journal serves that narrow borderland which separates the known from the unknown.” - Prasanta C. Mahalanobis (1.0)Introduction of Supermarket : What is Supermarket…? A supermarket is a self-service shop offering a wide variety of food, beverages and household products, organized into sections. It is larger and has a wider selection than earlier grocery stores, but is smaller and more limited in the range of merchandise than a hypermarket or big-box market. In everyday INDIA usage, however, "grocery store" is a synonym for supermarket, and is not used to refer to other types of stores that sell groceries. Supermarkets typically are chain stores, supplied by the distribution centers of their parent companies, thus increasing opportunities for economies of scale. Supermarkets usually offer products at relatively low prices by using their buying power to buy goods from manufacturers at lower prices than smaller stores can. They also minimize financing costs by paying for goods at least 30 days after
  • 9. 3 receipt and some extract credit terms of 90 days or more from vendors. Certain products (typically staple foods such as bread, milk and sugar) are very occasionally sold as loss leaders so as to attract shoppers to their store. Supermarkets make up for their low margins by a high volume of sales, and with of higher-margin items bought by the attracted shoppers. Self-service with shopping carts (trolleys) or baskets reduces labor costs, and many supermarket chains are attempting further reduction by shifting to self-service check-out. Real-time data allows grocery stores and supermarkets to forecast the potential sales and demand of their items through predictive analytics, highlighting which items are in demand and those to discard. Essentially, they use so-called Recency, Frequency, Value (or RFV) analysis to look at the transactional behavior of their customers and to score customers using a combination of how often they shop, how many items they purchase and how much they spend. Some famous Supermarket in India like 7-Eleven, BigBazaar, D- Mart, Easyday, Foodworld, HyperCity, Lulu Hypermarket and Maveli Stores. Here, we include some picture of Supermarket…
  • 10. 4 (1.1)Objective of the study : 1) To visualize how explanatory variablei.e., Branch, Customer type, Gender, Product line and Payment type affect to study variable sales. 2) To check Main and Interaction effect of explanatory variable on sales. 3) Fitting appropriate Time Series Model and analyzing study variable sales.
  • 11. 5 (1.2)Data Collection : Data collection: We have used secondary data. The Data of supermarket sales were collected from Kaggle website. The data were collected from 1 January 2019 to 30 march 2019. The analysis of supermarket is done on the base of these secondary data. Information about Kaggle website: Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and artificial intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 march 2017, Google announced that they were acquiring Kaggle. Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learningpractitioners. Kaggle allows users to find and publish data sets, explore and build models in a wed-based data science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. Kaggle services edit Machine Learning Competitions:This was Kaggle’s first product. Companies post problem and machine learners compete to build the best algorithm, typically with cash prizes. Kaggle kernels: a cloud-based workbench for data science and machine learning. Allows data scientists to share code and analysis in python, R
  • 12. 6 and R markdown. Over 150k “kernels”(code snippets) have been shared on Kaggle covering everything from sentiment analysis to object detection. Public datasets platform: community members share datasets with each other. Has datasets on everything from bone x-rays to results from boxing bouts. Kaggle learn: a platform for all education in manageable chunks. Type: subsidiary Industry: data science Founded: April 2010 Founder: Anthony Goldbloom, Ben Hamner Headquarters:San Francisco, United states Key people: Anthony Goldblood (CEO), Ben Hamner (CTO), Jeff Moser (chief architect) Products: Competitions, Kaggle kernels, Kaggle datasets, Kaggle learn Owner: Alphabet inc. (2017 present) Parent: Google (2017 present) Data Attribute information: Invoice id: computer generated sales slip invoice identification number Branch: branch of supercenters (3 branches are available identified by A, B, C) City: location of supercenters Customer type: Type of customers, recorded by member for customers using member card and normal for without member card Gender: gender type of customer
  • 13. 7 Product line: general item Categorization groups- electronic accessories, fashion accessories, food and beverages, health and beauty, home and lifestyle, sports and travel Unit price: price of each product in $ Quantity: number of products purchased by customer Tax: 5% tax fee for customer buying Total: total price including tax Date: date of purchase (record available from 1 January 2019 to 30 march 2019) Time: purchase time (10am to 9am) Payment: payment used by customer for purchase (3 method are available – cash, credit card and e-wallet) Cogs: cost of goods sold Gross margin Percentage: gross margin percentage Gross Income: gross income Rating: customer stratification rating on their overall shopping experience(on a scale of 1 to 10) Data Formation Categorical Data Branch, City, Gender, Customer type, Product line, Payment type Numerical Data COGS (Cost of Goods Sold), Rating Date Date, Time
  • 14. 8 (1.3)Statistical Techniques : “Statistics maybe defined as the science of collection, presentation, analysis and interpretation of numerical data.” Thus, we use statistical concepts to analyze the data. The brief introduction of the several methods used to analyze the data is as follows: (1.3.1)Descriptive Statistics: Descriptive statistics are used to describe the basic features of the data in the study. They provide simple summaries about the sample and the measures with simple graphical analysis. Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics we are simply describing what is or what the data shows. We have used some descriptive statistics like frequency distributions, Crosstabs and Charts, Measure of Central Tendency, Measure of Dispersion,Skewness, Kurtosis, Range etc. Cross Tabulation: A cross tabulation displays the joint distribution of two or more variables. They are usually presented as a contingency table in a matrix format which describes the distribution of two or more variables simultaneously. (1.3.2) Graphical Representation: Charts: Once the data have been collected, the crucial problem becomes learning, whatever we can, from the data. Graph is a powerful tool of describing the dataset. A large dataset is required to be presented in
  • 15. 9 graphical form that can capture the structure of underlying data. A quick glance at the picture elucidates the point easily then does a page filled with words and numbers. The term charts as a visual representation of data have multiple meaning. 1. Pie Charts: Pie chart is also called “angular charts”. A circle divided into portions that represent the relative frequencies or percentages of different categories or classes. This chart represents the value of the variable in the relative form of 360o . The area of 360o is divided into slices. 2. Line Charts: Line Charts are used to show the pattern of changes over a period of time, called trend. For Example, to observe the pattern of, change in height of child, closing of stocks, GDP of country. Line charts are used to observe these patterns. (1.3.3) GLM Univariate Analysis: The GLM Univariate procedure provides regression analysis and analysis of variance for one dependent variable by one or more factors and/or variables. The factor variables divide the population into groups. Using this General Linear Model procedure, you can test null hypotheses about the effects of other variables on the means of various groupings of a single dependent variable. You can investigate interactions between factors as well as the effects of individual factors, some of which may be random. In addition, the effects of covariates and covariate interactions with factors can be included. For regression analysis, the explanatory (predictor) variables are specified as covariates. Both balanced and unbalanced models can be tested. A design is balanced if each cell in the model contains the same number of cases. In addition to testing hypotheses, GLM Univariate produces estimates of parameters.
  • 16. 10 Commonly used a priori contrasts are available to perform hypothesis testing. Additionally, after an overall F test has shown significance, you can use post hoc tests to evaluate differences among specific means. Estimated marginal means give estimates of predicted mean values for the cells in the model, and profile plots (interaction plots) of these means allow you to easily visualize some of the relationships. Residuals, predicted values, Cook's distance, and leverage values can be saved as new variables in your data file for checking assumptions. WLS Weight allows you to specify a variable used to give observations different weights for a weighted least-squares (WLS) analysis, perhaps to compensate for a different precision of measurement. Example.Data are gathered for individual runners in the Chicago marathon for several years. The time in which each runner finishes is the dependent variable. Other factors include weather (cold, pleasant, or hot), number of months of training, number of previous marathons, and gender. Age is considered a covariate. You might find that gender is a significant effect and that the interaction of gender with weather is significant. Methods.Type I, Type II, Type III, and Type IV sums of squares can be used to evaluate different hypotheses. Type III is the default. Statistics.Post hoc range tests and multiple comparisons: least significant difference, Bonferroni, Sidak, Scheffé, Ryan-Einot- Gabriel-Welsch multiple F, Ryan-Einot-Gabriel-Welsch multiple range, Student-Newman-Keuls, Tukey's honestly significant difference, Tukey's b, Duncan, Hochberg's GT2, Gabriel, Waller- Duncan t test, Dunnett (one-sided and two-sided), Tamhane's T2, Dunnett's T3, Games-Howell, and Dunnett's C. Descriptive statistics: observed means, standard deviations, and counts for all of the dependent variables in all cells. The Levene test for homogeneity of variance. Plots.Spread-versus-level, residual, and profile (interaction).
  • 17. 11 ➢ GLM Univariate Data Considerations Data. The dependent variable is quantitative. Factors are categorical. They can have numeric values or string values of up to eight characters. Covariates are quantitative variables that are related to the dependent variable. Assumptions.The data are a random sample from a normal population; in the population, all cell variances are the same. Analysis of variance is robust to departures from normality, although the data should be symmetric. To check assumptions, you can use homogeneity of variances tests and spread-versus-level plots. You can also examine residuals and residual plots. ➢ To Obtain GLM Univariate Tables From the menus choose: Analyze General Linear Model Univariate... Select a dependent variable. Select variables for Fixed Factor(s), Random Factor(s), and Covariate(s), as appropriate for your data. Optionally, you can use WLS Weight to specify a weight variable for weighted least-squares analysis. If the value of the weighting variable is zero, negative, or missing, the case is excluded from the analysis. A variable already used in the model cannot be used as a weighting variable.
  • 18. 12 ▪ GLM Model Specify Model. A full factorial model contains all factor main effects, all covariate main effects, and all factor-by-factor interactions. It does not contain covariate interactions. Select Custom to specify only a subset of interactions or to specify factor-by-covariate interactions. You must indicate all of the terms to be included in the model. Factors and Covariates. The factors and covariates are listed with (F) for fixed factor and (C) for covariate. In a Univariate analysis, (R) indicates a random factor. Model. The model depends on the nature of your data. After selecting Custom, you can select the main effects and interactions that are of interest in your analysis. Sum of squares. The method of calculating the sums of squares. For balanced or unbalanced models with no missing cells, the Type III sum-of-squares method is most commonly used. Include intercept in model. The intercept is usually included in the model. If you can assume that the data pass through the origin, you can exclude the intercept. ▪ Specifying Models for GLM From the menus choose: Analyze General Linear Model Choose Univariate or Multivariate. In the dialog box, click Model. In the Model dialog box, select Custom.
  • 19. 13 Select one or more factors or covariates or a combination of factors and covariates. Select a method for building the terms and click the move button. Repeat until you have all of the terms that you want in the model. Do not use the same term more than once in the model. Select a type of sums of squares and whether or not you want the intercept. GLM Multivariate is available only if you have the Advanced Models option installed. ▪ Build Terms For the selected factors and covariates: Interaction. Creates the highest-level interaction term of all selected variables. This is the default. Main effects. Creates a main-effects term for each variable selected. All 2-way. Creates all possible two-way interactions of the selected variables. All 3-way. Creates all possible three-way interactions of the selected variables. All 4-way. Creates all possible four-way interactions of the selected variables. All 5-way. Creates all possible five-way interactions of the selected variables.
  • 20. 14 ▪ Sum of Squares For the model, you can choose a type of sums of squares. Type III is the most commonly used and is the default. Type I. This method is also known as the hierarchical decomposition of the sum-of-squares method. Each term is adjusted for only the term that precedes it in the model. Type I sums of squares are commonly used for: • A balanced ANOVA model in which any main effects are specified before any first-order interaction effects, any first- order interaction effects are specified before any second-order interaction effects, and so on. • A polynomial regression model in which any lower-order terms are specified before any higher-order terms. • A purely nested model in which the first-specified effect is nested within the second-specified effect, the second-specified effect is nested within the third, and so on. (This form of nesting can be specified only by using syntax.) Type II. This method calculates the sums of squares of an effect in the model adjusted for all other "appropriate" effects. An appropriate effect is one that corresponds to all effects that do not contain the effect being examined. The Type II sum-of-squares method is commonly used for: • A balanced ANOVA model. • Any model that has main factor effects only. • Any regression model. • A purely nested design. (This form of nesting can be specified by using syntax.) Type III. The default. This method calculates the sums of squares of an effect in the design as the sums of squares adjusted for any other effects that do not contain it and orthogonal to any effects (if any) that contain it. The Type III sums of squares have one major advantage in
  • 21. 15 that they are invariant with respect to the cell frequencies as long as the general form of estimability remains constant. Hence, this type of sums of squares is often considered useful for an unbalanced model with no missing cells. In a factorial design with no missing cells, this method is equivalent to the Yates' weighted-squares-of-means technique. The Type III sum-of-squares method is commonly used for: • Any models listed in Type I and Type II. • Any balanced or unbalanced model with no empty cells. Type IV. This method is designed for a situation in which there are missing cells. For any effect F in the design, if F is not contained in any other effect, then Type IV = Type III = Type II. When F is contained in other effects, Type IV distributes the contrasts being made among the parameters in F to all higher-level effects equitably. The Type IV sum-of-squares method is commonly used for: • Any models listed in Type I and Type II. • Any balanced model or unbalanced model with empty cells. ▪ GLM Contrasts Contrasts are used to test for differences among the levels of a factor. You can specify a contrast for each factor in the model (in a repeated measures model, for each between-subjects factor). Contrasts represent linear combinations of the parameters. GLM Univariate Hypothesis testing is based on the null hypothesis LB = 0, where L is the contrast coefficients matrix and B is the parameter vector. When a contrast is specified, SPSS creates an L matrix in which the columns corresponding to the factor match the contrast. The remaining columns are adjusted so that the L matrix is estimable.
  • 22. 16 The output includes an F statistic for each set of contrasts. Also displayed for the contrast differences are Bonferroni-type simultaneous confidence intervals based on Student's t distribution. ▪ Design of Experiment : One-Way Analysis of Variance (ANOVA) examines the differences between more than two explanatory samples. One-way ANOVA is used when you have a categorical explanatory variable (with two or more categories) and a normally distributed interval or ratio dependent variable.  Analysis of Variance can be done using one of the following experimental designs:- (1)Completely randomized Design:- In this experimental design, there is only one explanatory variable with two or more levels (also called treatments or classifications) and the difference in mean scores of these two or more explanatory populations is examined. (2) Randomized Block Design:- In this experimental design, there is one explanatory variable with two or more levels, and a second level variable, called blocking variable, which the researcher wants to control. (3)Factorial Design: - There are two or more explanatory level variables in this experimental design. Every level of an explanatory variable is studied for each level of remaining (other) explanatory variables. In factorial design, the impact of two (or more) explanatory variables is examined simultaneously. If there are two explanatoryvariables, we use Two- way ANOVA.
  • 23. 17 (1.3.4) Time Series Analysis: ➢ Introduction: A time series is a set of observations obtained by measuring a single variable regularly over a period of time. In a series of inventory data, for example, the observations might represent daily inventory levels for several months. A series showing the market share of a product might consist of weekly market share taken over a few years. A series of total sales figures might consist of one observation per month for many years. What each of these examples has in common is that some variable was observed at regular, known intervals over a certain length of time. Thus, the form of the data for a typical time series is a single sequence or list of observations representing measurements taken at regular intervals. One of the most important reasons for doing time series analysis is to try to forecast future values of the series. A model of the series that explained the past values may also predict whether and how much the next few values will increase or decrease. The ability to make such predictions successfully is obviously important to any business or scientific field. ➢ Utility of Time Series Analysis: The analysis of Time series is useful in many areas, such as econometrics, commerce, business, meteorology, demography, for the reasons given below: 1. It gives a general description of the past behavior of the series: By recording data over a period of time one can easily understand the changes that have a taken place in the past. In Table 1. Daily inventory time series Time Week Day Inventory level t1 1 Monday 160 t2 1 Tuesday 135 t3 1 Wednesday 129 t4 1 Thursday 122 t5 1 Friday 108 t6 2 Monday 150 ... t60 12 Friday 120
  • 24. 18 other words, time series enables us to study the past behavior of a phenomenon under consideration. 2. It helps in forecasting the future behavior on the basis of past behavior: A very important use of time series analysis is to make forecast about the likely value of a variable in future if the past behavior continues. The important of forecasting in business and econometrics fields lies on account of its role in planning and administration. 3. It facilitates comparison: Once the time series data are recorded the comparison between the values of the variable at different time points becomes handy. It helps to compare variations in the values of a variable over time and analyses the causes of such variations. 4. It helps in the evaluation of current accomplishments: The analysis of time series data greatly helps us in the review and evaluation of progress made in various economies, business and social activities, for example, the progress of five-year plans may be judged by studying the variations in the yearly rates of growth in gross national product (GNP), similarly, the variations in the general price level indicate the changes in the value of money over a period of time. ❖ Components of Time Series: Empirical studies of a number of time series have revealed thepresence of certain characteristic movements or fluctuations in atime series these characteristic movements of fluctuations in a time series. These characteristic movements of a time series may be classified in four different categories called components of time series. In a long time series, generally we have the following four components: 1. Secular Trend or long-term movements 2. Seasonal variations 3. Cyclic variations 4. Random or Irregular movements
  • 25. 19 1.)Secular Trend: The word trend means ‘tendency’. So, secular trend is that component of the time series which gives the general tendency of the data for a long period. It is smooth, regular and long-term movement of a series. The steady growth of the same status for a particular commodity of a company or the fall of demand for a certain article for long years can be studied through secular trend. Do note that rapid fluctuations cannot give the trend. Growth of population in a locality over decades is a good example of secular trend. 2.)Seasonal Variations: If we observe the sale structure of clothes in the market, we will find that the sale curve is not uniform throughout the year. It shows different trend in different seasons. It depends entirely on the locality and the people who reside there. It can also be seen that each and every year, sale structure is more or less same as the previous year in those periods. So, this component occurs uniformly and regularly. This variation is periodic in nature and regular in character. 3.)Cyclic variations: Apart from seasonal variations, there is another type of fluctuation which usually lasts for more than a year. This fluctuation is the effect of business cycles. In every business there are four important phases- I) prosperity, ii) decline, iii) depression, and v) improvement or regain. The time from prosperity to regain is a complete cycle. So, this cycle will never show regular periodicity. A period of a cycle may differ but, importantly, the sequence of changes should be more or less regular and it is this fact of regularity which enables us to study cyclical fluctuations.
  • 26. 20 4.)Random or Irregular movements: These are, as the name suggests, totally unpredictable. The effects due to flood, draughts, famines, earthquakes, etc are known as irregular variations. All variations excluding trend, seasonal and cyclical variations are irregular. Sometimes cyclical fluctuations too can get generated from natural calamities, though. ➢ Seasonal Decomposition: The Seasonal Decomposition procedure decomposes a series into a seasonal component, a combined trend and cycle component, and an "error" component. The procedure is an implementation of the Census Method I, otherwise known as the ratio-to-moving-average method. Example. A scientist is interested in analyzing monthly measurements of the ozone level at a particular weather station. The goal is to determine if there is any trend in the data. In order to uncover any real trend, the scientist first needs to account for the variation in readings due to seasonal effects. The Seasonal Decomposition procedure can be used to remove any systematic seasonal variations. The trend analysis is then performed on a seasonally adjusted series. Statistics. The set of seasonal factors. Data. The variables should be numeric. Assumptions. The variables should not contain any embedded missing data. At least one periodic date component must be defined. For instructions on handling missing data, see the topic on replacing missing values. Here we will discuss about multiplicative and additive model. The analysis of a time series is the decomposition of a time series into its different components for their separate study. The process of analyzing a time series is to isolate and measure its various
  • 27. 21 components. We try to answer the following questions when we analyze a time series. 1. What would have been the value of the variable at different points of time if it were influenced only be long time movements? 2. What changes occur in the value of the variable due to seasonal variations? 3. To what extent and in what direction has the variable been affected by cyclical fluctuations? 4. What has been the effect of irregular variations? The study of a time series is mainly required for estimation and forecasting. An ideal forecast should base on forecasts of the various types of fluctuations. Separate forecasts should be made of the trend, seasonal and cyclical variations. These forecasts become doubtful for a forecast of irregular movements. Therefore, it is necessary to separate and measure various types of fluctuations present in a time series. A value of a time series variable considered as the resultant of the combined impact of its components. The components of a time series follow either the multiplicative or the additive model. Let Y= original observation, T= trend component, S=seasonal component, C=cyclical component, and I=irregular component. Multiplicative Model: It is assumed that the value Y of a composite series is the product of the four components. That is Y=T×S×C×I, where T is given in original units of Y, but S, C, and I are expressed as percentage unit-less index numbers.
  • 28. 22 Additive Model: It is assumed that the value of Y of a composite series is the sum of the four components. That is Y=T+S+C+I, where T, S, C, and I all are given in the original units of Y. Time series analysis is the analysis of a series of data-points over time, allowing one to answer a question such as what is the causal effect on a variable Y of a change in variable X over time? An important difference between time series and cross-section data is that the ordering of cases does matter in time series. Rather than dealing with individuals as units, the unit of interest is time: the value of Y at time t is Yt. The unit of time can be anything from days to election years. The value of Yt in the previous period is called the first lag value: Yt−1. The jth lag is denoted: Yt−j. Similarly, Yt+1 is the value of Yt in the next period. So, a simple bivariate regression equation for time series data looks like: Yt=β0+βXt+ut Yt is treated as random variable. If Yt is generated by some model (Regression model for time series i.e., Yt=xtβ+εt, E(εt|xt) =0, then ordinary least square (OLS) provides a consistent estimate of β. ➢ Regression: Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as explanatory variables). Regression Explained The two basic types of regression are simple linear regression and multiple linear regression, although there are non-linear regression
  • 29. 23 methods for more complicated data and analysis. Simple linear regression uses one explanatory variable to explain or predict the outcome of the dependent variable Y, while multiple linear regression uses two or more explanatory variables to predict the outcome. Regression can help finance and investment professionals as well as professionals in other businesses. Regression can also help predict sales for a company based on weather, previous sales, GDP growth, or other types of conditions. The general form of each type of regression is: • Simple linear regression: Y = a + bX + u • Multiple linear regression: Y = a + b1X1 + b2X2 + ... + btXt + u Where: • Y = the variable that you are trying to predict (dependent variable). • X = the variable that you are using to predict Y (explanatory variable). • a = the intercept. • b = the slope. • u = the regression residual. ➢ Dummy Technique: In general, the explanatory variables in any regression analysis are assumed to be quantitative in nature. For example, the variables like temperature, distance, age etc. are quantitative in the sense that they are recorded on a well-defined scale. In many applications, the variables cannot be defined on a well- defined scale, and they are qualitative in nature. For example, the variables like sex (male or female), color (black, white), nationality, employment status (employed, unemployed) are defined on a nominal scale. Such variables do not have any natural
  • 30. 24 scale of measurement. Such variables usually indicate the presence or absence of a “quality” or an attribute like employed or unemployed, graduate or non-graduate, smokers or non- smokers, yes or no, acceptance or rejection, so they are defined on a nominal scale. Such variables can be quantified by artificially constructing the variables that take the values, e.g., 1 and 0 where “1” usually indicates the presence of attribute and “0” usually indicates the absence of the attribute. For example, “1” indicator that the person is male and “0” indicates that the person is female. Similarly, “1” may indicate that the person is employed and then “0” indicates that the person is unemployed. Such variables classify the data into mutually exclusive categories. These variables are called indicator variable or dummy variables. Usually, the indicator variables take on the values 0 and 1 to identify the mutually exclusive classes of the explanatory variables. For example, 𝐷= { 1 ;if person is male 0 ;if person is female 𝐷= { 1 ;if person is employed 0 ;if person is unemployed Here we use the notation D in place of X to denote the dummy variable. The choice of 1 and 0 to identify a category is arbitrary. For example, one can also define the dummy variable in the above examples as 𝐷= { 1 ;if person is male 0 ;if person is female 𝐷= { 1 ;if person is employed 0 ;if person is unemployed It is also not necessary to choose only 1 and 0 to denote the category. In fact, any distinct value of D will serve the purpose. The choices of 1 and 0 are preferred as they make the calculations simple, help in the
  • 31. 25 easy interpretation of the values and usually turn out to be a satisfactory choice. In a given regression model, the qualitative and quantitative can also occur together, i.e., some variables are qualitative, and others are quantitative. When all explanatory variables are - quantitative, then the model is called a regression model, - qualitative, then the model is called an analysis of variance model and - quantitative and qualitative both, then the model is called an analysis of covariance model. Such models can be dealt with within the framework of regression analysis. The usual tools of regression analysis can be used in the case of dummy variables. ➢ PACF Plot: In time series analysis, the partial autocorrelation function (PACF) gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. It contrasts with the autocorrelation function, which does not control for other lags. This function plays an important role in data analysis aimed at identifying the extent of the lag in an autoregressive model. The use of this function was introduced as part of the Box–Jenkins’s approach to time series modelling, whereby plotting the partial autocorrelative functions one could determine the appropriate lags p in an AR (p) model or in an extended ARIMA (p, d, q) model.
  • 32. 26 ➢ Simple Exponential Smoothing: Exponential smoothing is the most widely used class of procedures for smoothing discrete time series in order to forecast the immediate future. The idea of exponential smoothing is to smooth the original series the way the moving average does and to use the smoothed series in forecasting future values of the variable of interest. In exponential smoothing, however, we want to allow the more recent values of the series to have greater influence on the forecast of future values than the more distant observations. Exponential smoothing is a simple and pragmatic approach to forecasting, whereby the forecast is constructed from an exponentially weighted average of past observations. The largest weight is given to the present observation, less weight to the immediately preceding observation, even less weight to the observation before that, and so on (exponential decay of influence of past data). Non-Seasonal Simple Exponential Smoothing This forecasting method is most widely used of all forecasting techniques. It requires little computation. This method is used when data pattern is approximately horizontal (i.e., there is no neither cyclic variation nor pronounced trend in the historical data). Let an observed time series be y1, y2, …. yn. Formally, the simple exponential smoothing equation takes the form of St+1 = αyt + (1-α) St Where Si→ The smoothed value of time series at time i yi→ Actual value of time series at time i α → Smoothing constant
  • 33. 27 In case of simple exponential smoothing, the smoothed statistic is the Forecasted value. Ft+1 = αyt + (1-α) Ft Where Ft+1→ Forecasted value of time series at time t+1 Ft→ Forecasted value of time series at time t This means: Ft = αyt-1 + (1-α) Ft-1 Ft-1 = αyt-1 + (1-α) Ft-2 Ft-2 = αyt-2 + (1-α) Ft-3 Ft-3 = αyt-3 + (1-α) Ft-4 Substituting, Ft+1 = αyt + (1-α) Ft = αyt + (1-α)(αyt-1 + (1-α)Ft-1) = αyt + α (1-α) yt-1 + (1-α)2 Ft-1 = αyt + α (1-α) yt-1 + α (1-α)2 yt-2 + (1-α)3 Ft-2 = αyt + α (1-α) yt-1 + α (1-α)2 yt-2 + α(1-α)3 yt-3 + (1-α)4 Ft-3 Generalizing, The series of weights used in producing the forecast Ft are α , α (1-α ), α(1-α)2 ,α(1-α)3 …. These weights decline toward zero in an exponential fashion; thus, as we go back in the series, each value has a smaller weight in terms of its effect on the forecast. The exponential decline of the weights towards zero is evident.
  • 34. 28 Choosing α: After the model is specified, its performance characteristics should be verified or validated by comparison of its forecast with historical data for the process it was designed to forecast. We can use the error measures such as MAPE (Mean absolute percentage error), MSE (Mean square error) or RMSE (Root mean square error) and α is chosen such that the error is minimum. Usually the MSE or RMSE can be used as the criterion for selecting an appropriate smoothing constant. For instance, by assigning a value from 0.1 to 0.99, we select the value that produces the smallest MSE or RMSE. Simple exponential smoothing method is used for a time series data with no trend or seasonality. In this method, a single smoothing factor or coefficient alpha (α) is used which decides the influence of past values on the forecast. If α is closer to ‘1’, the forecast is more impacted by the most recent values than the older values. The opposite is true if α is close to ‘0’. Simple. This model is appropriate for series in which there is no trend or seasonality. Its only smoothing parameter is level. Simple exponential smoothing is most similar to an ARIMA model with zero orders of autoregression, one order of differencing, one order of moving average, and no constant.
  • 35. 29 ❖ Inferential Statistics: With inferential statistics, we try to reach to conclusions that extend beyond the immediate data alone. This includes different techniques of estimation and testing of hypothesis. Run Test: DEFITNITION: A run is defined as a sequence of like events, items or symbols that is preceded and/or followed by an event, item or symbol of different type or by none at all. For e.g. 1) outcomes of “tossing of a coin” HHTTTHHHHHTHHH Here we have 5 runs 2) Sex of a newly born baby MFFMMFFFFMMMMMF Here we have 6 runs Length of the parent run= number of symbols/events/items in the run We can test the randomness of the of the sequence using runs. Too many or too few runs indicate lack of randomness in the sequence. For e.g., HHHHHHHTTTTTT= number of runs is 2 HTHTHTHTHT= number of runs is 10 Hypothesis: Ho: The sequence(sample) is random H1: The sequence (sample) is not random
  • 36. 30 Test Procedure: 1. Convert the observations to + or – signs + if observations >Mo (or any cutoff point) -if observations< Mo (or any cutoff point) 2. Let n1 be the number of symbols of one type and n2 be the number of symbols of another type are to be used for testing the hypothesis of randomness of the sample Test Statistic: R= No. of runs Decision Rule: • We reject Ho if there are too many or too few runs in a sequence • For test of significance level α Reject Ho if r<= C1 or r>=C2 Where C1 and C2 are such that P(R≤C1 or R≥C2|Ho)=α where C1 and C2 are critical values which can be obtained from the table for given n1, n2 and Level of significance α.
  • 38. 32 (2.0)What is statistical analysis? It’s the science of collecting, exploring and presenting large amounts of data to discover underlying patterns and trends. Statistics are applied every day – in research, industry and government – to become more scientific about decisions that need to be made. For example: • Manufacturers use statistics to weave quality into beautiful fabrics, to bring lift to the airline industry and to help guitarists make beautiful music. • Researchers keep children healthy by using statistics to analyze data from the production of viral vaccines, which ensures consistency and safety. • Communication companies use statistics to optimize network resources, improve service and reduce customer churn by gaining greater insight into subscriber requirements. • Government agencies around the world rely on statistics for a clear understanding of their countries, their businesses and their people. Look around you. From the tube of toothpaste in your bathroom to the planes flying overhead, you see hundreds of products and processes every day that have been improved through the use of statistics.
  • 39. 33 (2.1)Analysis and Interpretation : (2.1.1) Descriptive Statistics: Here, we have explanatory variable like… 1. Branch 2. Customer Type 3. Gender 4. Product Line 5. Payment Type We will visualize every single explanatory variable and interaction of it using summary table and charts… 1.)Branch wise distribution of sum of cogs: A B C Grand Total Sum of cogs 101143.21 101140.64 105303.53 307587.38 INTERPRETATION: From the above Pie chart we can observed that Branch A and B has almost same sale percentage where Branch C sales 1 % more than other two branch. But statistically it may not be significant difference. A 33% B 33% C 34% SUM OF COGS
  • 40. 34 2.)Customer type wise distribution of sum of cogs: Member Normal Grand Total Sum of cogs 156403.28 151184.1 307587.38 INTERPRETATION: From above Pie chart we can observed that the difference between purchase by members and non-member customers is 2% only. That may not be statistically significant difference. 3.)Gender wise distribution of sum of cogs: Female Male Grand Total Sum of cogs 159888.5 147698.88 307587.38 Member 51% Normal 49% SUM OF COGS Female 52% Male 48% SUM OF COGS
  • 41. 35 INTERPRETATION: From above Pie chart we can observed that the difference between purchase by male and female customers is 4% only. That may not be statistically significant difference. 4.)Product Line wise distribution of sum of cogs: Electronic accessories Fashion accessories Food and beverages Health and beauty Home and lifestyle Sports and travel Grand Total Sum of cogs 51750.03 51719.9 53471.28 46851.18 51297.06 52497.93 307587.38 INTERPRETATION:From the above Pie chart we can observed that only Health and beauty product sold 2% less than other product. That may not be statistically significant difference. 5.)Payment type wise distribution of sum of cogs: Cash Credit card Ewallet Grand Total Sum of cogs 106863.4 95968.64 104755.34 307587.38 17% 17% 17% 15% 17% 17% SUM OF COGS Electronic accessories Fashion accessories Food and beverages Health and beauty Home and lifestyle Sports and travel Cash 35% Credit card 31% Ewallet 34% SUM OF COGS
  • 42. 36 INTERPRETATION: From above Pie chart we can observed that the difference between different types of payment modes is 2 & 3 % only. That may not be statistically significant difference. (2.1.2) Dynamic Graphical Representation: ➢ In this section we see all the details in link (web app) on the below webliography section. (2.1.3)GLM Univariate Analysis: ➢ In this General Linear Model, we shall focus our discussion on procedure for undertaking Randomized Block Design. ExplanatoryVariable: - 1. Branch 2. Customer Types 3. Gender 4. Product line 5. Payment mode StudyVariable: - Sales_price General Linear Model: -
  • 43. 37 Tests of Between-Subjects Effects Dependent Variable: Sales Price Source Type III Sum of Squares df Mean Square F Sig. Corrected Model 10224064.111(a) 212 48226.718 .852 .922 Intercept 70388095.609 1 70388095.609 1243.172 .000 Branch 68983.289 2 34491.645 .609 .544 Customer type 519.526 1 519.526 .009 .924 Gender 32336.112 1 32336.112 .571 .450 Product line 190911.735 5 38182.347 .674 .643 Payment 15559.998 2 7779.999 .137 .872 Branch * Customer type 26436.190 2 13218.095 .233 .792 Branch * Gender 2850.638 2 1425.319 .025 .975 Branch * Product line 514667.881 10 51466.788 .909 .524 Branch * Payment 74001.010 4 18500.252 .327 .860 Customer type * Gender 2994.486 1 2994.486 .053 .818 Customer type * Product line 118382.864 5 23676.573 .418 .836 Customer type * Payment 53461.651 2 26730.826 .472 .624 Gender * Product line 543239.866 5 108647.973 1.919 .089 Gender * Payment 104577.252 2 52288.626 .924 .398 Product line * Payment 204762.231 10 20476.223 .362 .963 Branch * Customer type * Gender 138284.818 2 69142.409 1.221 .295 Branch * Customer type * Product line 760263.025 10 76026.303 1.343 .203 Branch * Customer type * Payment 155830.722 4 38957.681 .688 .600 Branch * Gender * Product line 521714.184 10 52171.418 .921 .513 Branch * Gender * Payment 142596.265 4 35649.066 .630 .641 Branch * Product line * Payment 816445.905 20 40822.295 .721 .807 Customer type * Gender * Product line 70397.832 5 14079.566 .249 .941 Customer type * Gender * Payment 125557.756 2 62778.878 1.109 .330 Customer type * Product line * Payment 337879.135 10 33787.914 .597 .817 Gender * Product line * Payment 312538.102 10 31253.810 .552 .853 Branch * Customer type * Gender * Product line 203047.194 10 20304.719 .359 .964 Branch * Customer type * Gender * Payment 80392.971 4 20098.243 .355 .841 Branch * Customer type * Product line * Payment 1536469.417 20 76823.471 1.357 .136 Branch * Gender * Product line * Payment 1235395.321 20 61769.766 1.091 .353 Customer type * Gender * Product line * Payment 742647.567 10 74264.757 1.312 .220 Branch * Customer type * Gender * Product line * Payment 529382.133 17 31140.125 .550 .927 Error 44559734.909 787 56619.739 Total 149393795.355 1000 Corrected Total 54783799.020 999 The standard error of the model is very high and only intercept is statistically significant so it is clear that the model is insignificant as well as all the main effects and interaction effects are also statistically insignificant.
  • 44. 38 Hypothesis:- (A) H0 : There is no difference between All Main Effects towards sales. H1 : There is difference between All Main Effects towards sales. AND (B) H0 : There is no difference between All Interaction Effects towards sales. H1 : There is difference between All Interaction Effects towards sales. Conclusion: - (A) Here, All P-value Of All MainEffects > α; therefore, the data provides enough evidence to do not reject the null hypothesis at 5% level of significance. (α = 0.05) Hence, there is no difference between All Main Effects towards sales. (B) Here, All P-value Of All Interaction Effects > α; therefore, the data provides enough evidence to do not reject the null hypothesis at 5% level of significance. (α = 0.05) Hence, there is no difference between All Interaction Effects towards sales.
  • 45. 39 (2.1.4) Time Series Analysis: • Introduction: A time series is a set of observations obtained by measuring a single variable regularly over a period of time. In a series of inventory data, for example, the observations might represent daily inventory levels for several months. A series showing the market share of a product might consist of weekly market share taken over a few years. A series of total sales figures might consist of one observation per month for many years. What each of these examples has in common is that some variable was observed at regular, known intervals over a certain length of time. Thus, the form of the data for a typical time series is a single sequence or list of observations representing measurements taken at regular intervals. There are two main goals of time series analysis: identifying the nature of the phenomenon represented by the sequence of observations, and forecasting (predicting future values of the time series variable). • Descriptive: We have Super-Market sales of Three Branch which are located in city Yangon (Branch A), Mandalay (Branch B) and Naypyitaw (Branch C). We have sales of 89 days (01-01-2019 to 30-03-2019) and named as “Cogs1” for Branch A, “Cogs2” for Branch B and “Cogs3” for Branch C. Where Cogs stand for “Cost Of Goods Sold”. Path: SPSS (Analyze → Descriptive Statistics) Descriptive Statistics 89 2950.84 148.67 3099.51 101143.21 1136.4406 703.48064 .780 .255 -.012 .506 89 3402.15 .00 3402.15 101140.64 1136.4117 866.10639 .862 .255 .027 .506 89 3459.88 .00 3459.88 105303.53 1183.1857 733.52896 .781 .255 .641 .506 89 cogs1 cogs2 cogs3 Valid N (listwise) Statistic Statistic Statistic Statistic Statistic Statistic Statistic Statistic Std. Error Statistic Std. Error N Range Minimum Maximum Sum Mean Std. Deviation Skewness Kurtosis
  • 46. 40 Here, we notice that in Cogs2 and Cogs3 minimum value is 0 (Zero), because in Cogs2 11-Jan, 23-Jan and 1-Fab sales is zero, Also in Cogs3 22-March sale is zero. Also, we have to notice that rage is very large and standard division is very high. • Graphical Representation: Using Graph, we can easily understand the pattern of data. And we can decide the further step to analyze our “Time Series Data”. Path:SPSS (Analyze → Time Series → Sequence chart)
  • 47. 41 From the above all graph we can observe that their no upward or downward trend in data. Also, from graph pattern there may exists Seasonality, Cyclic or Irregular (Random) component in data. • Seasonal Decomposition: Note : - Here, we have only three-month (89 days) data so, we can’t talk about seasonal effect. In Seasonal decomposition we check only cyclic effect. →SAF: Seasonal adjustment factors, representing seasonal variation. For the multiplicative model, the value 1 represents the absence of seasonal variation; for the additive model, the value 0 represents the absence of seasonal variation. Path: SPSS (Analyze → Time Series → Seasonal Decomposition) →SAS. Seasonally adjusted series, representing the original series with seasonal variations removed. Working with a seasonally adjusted series, for example, allows a trend component to be isolated and analyzed explanatory of any seasonal component. SAS = Orignal Series SAF of it`s Period → STC. Smoothed trend-cycle component, which is a smoothed version of the seasonally adjusted series that shows both trend and cyclic components. STC = SAS ERR → ERR. The residual component of the series for a particular observation. ✓ Now, we observed our data decomposition one-by-one, and checking which type is it…
  • 48. 42 1) Cogs1 : Now, we observed STC (Smoothed trend-cycle) which display Trend and Cyclic component in series. Conclusion : Here, from above graph we observed that there is no any Cyclic effect. 2) Cogs2 : Here, we observed STC (Smoothed trend-cycle) which display Trend and Cyclic component in series. Conclusion: Here, from above graph we observed that there is no any Cyclic effect. 3) Cogs3 : Here, we observed STC (Smoothed trend-cycle) which display Trend and Cyclic component in series. Conclusion :Here, from above graph we observed that there is no any Cyclic effect. 0 2000 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 COGS Days STC_1 STC_1 Linear (STC_1) 0 2000 4000 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 cogs days STC_2 STC_2 Linear (STC_2) 0 1000 2000 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 cogs days STC_3 STC_3 Linear (STC_3)
  • 49. 43 • Regression over Cyclic Dummy: Now, we try to fit Linear model using cyclic dummy over Cogs. Here we create dummy over day of weeks. Dummies are as following…[Path:SPSS (Transform → Compute variable → use “Any” function)] 𝐷1= { 1 ;if Sunday 0 ;elsewhere 𝐷2= { 1 ;if Monday 0 ;elsewhere 𝐷3= { 1 ;if Tuesday 0 ;elsewhere 𝐷4= { 1 ; if Wednesday 0 ; elsewhere 𝐷5= { 1 ;if Thursday 0 ;elsewhere 𝐷6= { 1 ;if Friday 0 ;elsewhere 1.)Cogs1 : Conclusion : Here, we can observe that model is insignificant. So, we can say that in our data there is no Weekly Cyclic Component. Model Summary b .159a .025 -.046 719.54104 2.048 Model 1 R R Square Adjusted R Square Std. Error of the Estimate Durbin- Watson Predictors: (Constant), D7, D6, D5, D4, D2, D1 a. Dependent Variable: cogs1 b. ANOVA b 1095258 6 182542.942 .353 .906a 42454624 82 517739.312 43549881 88 Regression Residual Total Model 1 Sum of Squares df Mean Square F Sig. Predictors: (Constant), D7, D6, D5, D4, D2, D1 a. Dependent Variable: cogs1 b. Coefficientsa 1130.084 207.714 5.441 .000 -34.464 288.047 -.017 -.120 .905 .562 1.779 -158.090 288.047 -.080 -.549 .585 .562 1.779 -.247 288.047 .000 -.001 .999 .562 1.779 -64.600 288.047 -.033 -.224 .823 .562 1.779 98.151 288.047 .050 .341 .734 .562 1.779 219.663 293.751 .107 .748 .457 .578 1.730 (Constant) D1 D2 D3 D4 D5 D6 Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Tolerance VIF Collinearity Statistics Dependent Variable: cogs1 a.
  • 50. 44 2.)Cogs2: Conclusion : Here, we can observe that model is insignificant. So, we can say that in our data there is no Weekly Cyclic Component. 3.)Cogs3: Model Summary b .278a .077 .009 861.98289 2.083 Model 1 R R Square Adjusted R Square Std. Error of the Estimate Durbin- Watson Predictors: (Constant), D6, D5, D4, D2, D3, D1 a. Dependent Variable: cogs2 b. ANOVA b 5085155 6 847525.902 1.141 .346a 60927190 82 743014.511 66012345 88 Regression Residual Total Model 1 Sum of Squares df Mean Square F Sig. Predictors: (Constant), D6, D5, D4, D2, D3, D1 a. Dependent Variable: cogs2 b. Coefficientsa 1010.691 248.833 4.062 .000 370.938 345.069 .152 1.075 .286 .562 1.779 -79.732 345.069 -.033 -.231 .818 .562 1.779 145.252 345.069 .060 .421 .675 .562 1.779 45.555 345.069 .019 .132 .895 .562 1.779 548.608 345.069 .225 1.590 .116 .562 1.779 -184.078 351.903 -.073 -.523 .602 .578 1.730 (Constant) D1 D2 D3 D4 D5 D6 Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Tolerance VIF Collinearity Statistics Dependent Variable: cogs2 a. Model Summary b .231a .053 -.016 739.42112 2.192 Model 1 R R Square Adjusted R Square Std. Error of the Estimate Durbin- Watson Predictors: (Constant), D6, D5, D4, D2, D3, D1 a. Dependent Variable: cogs3 b. ANOVAb 2516722 6 419453.743 .767 .598a 44832975 82 546743.592 47349697 88 Regression Residual Total Model 1 Sum of Squares df Mean Square F Sig. Predictors: (Constant), D6, D5, D4, D2, D3, D1 a. Dependent Variable: cogs3 b.
  • 51. 45 Conclusion : Here, we can observe that model is insignificant. So, we can say that in our data there is no Weekly Cyclic Component. • PACF Graph: Here, we compute PACF Graph, we already check 1st order autocorrelation in above Regression using DW statistic and all study variable (Cogs1, Cogs2 and Cogs3) have DW statistic near 2. So, there is no 1st Order Autocorrelation. To, check further order Autocorrelation we simply plot PACF Graph using SPSS. Path:SPSS (Analyze → Time Series → Autocorrection) Coefficients a 867.088 213.452 4.062 .000 427.256 296.005 .207 1.443 .153 .562 1.779 433.704 296.005 .210 1.465 .147 .562 1.779 169.421 296.005 .082 .572 .569 .562 1.779 229.228 296.005 .111 .774 .441 .562 1.779 456.792 296.005 .221 1.543 .127 .562 1.779 484.955 301.867 .227 1.607 .112 .578 1.730 (Constant) D1 D2 D3 D4 D5 D6 Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Tolerance VIF Collinearity Statistics Dependent Variable: cogs3 a.
  • 52. 46 Conclusion: Here, from PACF graph we can observe that our all- study variable (Cogs1, Cogs2 & Cogs3)has no any order Autocorrection or we can say 0th order Autocorrection. • Model Fitting: As, above we analysed our study variable we notice that in our study variable there is No Trend, No Seasonality and No Cyclic effect. So, we can say that our data is Irregular or Random. So, to confirmed that we compute a Run Test over our study variable. Run test is a statistical test used to determine of the data obtained from a sample is random. That is why it is called Run Test for Randomness. Randomness of the data is determined based on the number and nature of runs present in the data of interest.
  • 53. 47 Conclusion: Here, from above table we can see all study variable are Insignificant. So, conclude that our study variable is Random. So, we apply Simple (Single) Exponential Smoothing model. Because in our study variable there is no trend and no seasonality. The daily change in the Cogs1, Cogs2 & Cogs3 has no trend, seasonality or cyclic behaviour. There are random fluctuations which do not appear to be very predictable, and no strong patterns that would help with developing a forecasting model. NON-SEASONALSIMPLE EXPONENTIAL SMOOTHING Path:SPSS (Analyze→ Time Series → Create Model →Method: Exponential Smoothing: Simple Non-Seasonal) Runs Test 1136.4406 1136.4117 1183.1857 53 54 43 36 35 46 89 89 89 43 45 50 -.194 .342 .971 .846 .733 .331 Test Valuea Cases < Test Value Cases >= Test Value Total Cases Number of Runs Z Asymp. Sig. (2-tailed) cogs1 cogs2 cogs3 Mean a. Model Description Simple Simple Simple Model_1 cogs1 Model_2 cogs2 Model_3 cogs3 Model ID Model Type Model Fit .529 .015 .516 .546 .516 .516 .516 .526 .546 .546 .546 -.015 .004 -.020 -.011 -.020 -.020 -.020 -.014 -.011 -.011 -.011 773.693 88.916 707.504 874.761 707.504 707.504 707.504 738.814 874.761 874.761 874.761 138.842 44.686 94.651 184.007 94.651 94.651 94.651 137.868 184.007 184.007 184.007 2641.911 1740.197 679.726 3998.047 679.726 679.726 679.726 3247.960 3998.047 3998.047 3998.047 625.655 76.534 580.425 714.021 580.425 580.425 580.425 582.519 714.021 714.021 714.021 2124.862 180.713 1916.833 2243.021 1916.833 1916.833 1916.833 2214.733 2243.021 2243.021 2243.021 13.344 .224 13.174 13.598 13.174 13.174 13.174 13.261 13.598 13.598 13.598 Fit Statistic Stationary R-squared R-squared RMSE MAPE MaxAPE MAE MaxAE Normalized BIC Mean SE Minimum Maximum 5 10 25 50 75 90 95 Percentile
  • 54. 48 Here, we can observe from Model Fit table MAPE = 184.007, means there is 184.007% error in model fitting. Model Statistics 0 .516 -.011 15.584 17 .553 0 0 .526 -.020 15.208 17 .581 0 0 .546 -.014 14.749 17 .614 0 Model cogs1-Model_1 cogs2-Model_2 cogs3-Model_3 Number of Predictors Stationary R-squared R-squared Model Fit statistics Statistics DF Sig. Ljung-Box Q(18) Number of Outliers Forecast 1165.40 1165.40 1165.40 1165.40 1165.40 1165.40 1165.40 1165.40 1165.40 2571.41 2571.46 2571.51 2571.55 2571.60 2571.65 2571.69 2571.74 2571.79 -240.62 -240.67 -240.71 -240.76 -240.81 -240.86 -240.90 -240.95 -241.00 1133.98 1133.98 1133.98 1133.98 1133.98 1133.98 1133.98 1133.98 1133.98 2872.38 2872.69 2873.01 2873.32 2873.63 2873.94 2874.26 2874.57 2874.88 -604.43 -604.74 -605.05 -605.37 -605.68 -605.99 -606.30 -606.62 -606.93 1197.97 1197.97 1197.97 1197.97 1197.97 1197.97 1197.97 1197.97 1197.97 2666.20 2666.31 2666.41 2666.51 2666.62 2666.72 2666.82 2666.92 2667.03 -270.27 -270.37 -270.48 -270.58 -270.68 -270.78 -270.89 -270.99 -271.09 Forecast UCL LCL Forecast UCL LCL Forecast UCL LCL Model cogs1-Model_1 cogs2-Model_2 cogs3-Model_3 13 Fri 13 Sat 14 Sun 14 Mon 14 Tue 14 Wed 14 Thu 14 Fri 14 Sat For each model, forecasts start after the last non-missing in the range of the requested estimation period, and end at the last period for which non-missing values of all the predictors are available or at the end date of the requested forecast period, whichever is earlier.
  • 55. 49 That is, all forecasts take the same value, equal to the last level component. It’s also called Flat Forecast. Simple exponential smoothing has a “flat” forecast function: Here, we Forecast for up to 14weeks using Simple Exponential Smoothing Method.
  • 56. 50 (2.2)Findings : (2.2.1) Descriptive: 1. We can observe that Branch A and B has almost same sale percentage where Branch C sales more than other two branch. 2. We can observe that Member are purchases more than Normal customer. 3. We can observe that Females are purchases more than Male customer. 4. We can observe that only Health and beauty product sold 2% less than other product. 5. we can observe that customer more prefer to buy product using Cash Payment type. (2.2.2) Dynamic Graphical Representation: 1. We can observe that the dynamic result by hanging on the graph. 2. We can observe that the bar chart race will provide information about different Branches, Cities, Customer type, Gender, Product line and Payment mode which give the total sales information, distributed by date increase. 3. We can observe that the pie chart will provide information about different Branches, Cities, Customer type, Gender, Product line and Payment mode which give the total sales information, distributed by percentage or proportional data.
  • 57. 51 (2.2.3)Univariate GLM: 1. There is no significant effect of explanatory variables (Branch, Customer type, Gender, Product line and Payment type) Main and Interaction effect on study variable (Sales_Price). 2. The main aspect that we have consider in our project is the smart application of General Linear Model with dummy variable. This enables us to decide about the significant effect of all the main effects, interaction effects as well as categorical variables at a time. Thus, we can avoid individual testing & can save our time and energy. (2.2.4)Time Series Analysis: 1. We can observe that their no upward or downward trend in data. Also, from graph pattern there may exists Seasonality, Cyclic or Irregular (Random) component in data. 2. From Seasonal Decomposition we observed that there is no Trend or Cyclic component in Cogs1, Cogs2 and Cogs3. 3. From Regression over cyclic dummy, we can observe that model is insignificant. So, we can say that in our data there is no Weekly Cyclic Component. 4. From PACF graph we can observe that our all-study variable (Cogs1, Cogs2 & Cogs3) has no any order Autocorrection or we can say 0th order Autocorrection.
  • 58. 52 5. The daily change in the Cogs1, Cogs2 & Cogs3 has no trend, seasonality or cyclic behaviour. There are random fluctuations which do not appear to be very predictable, and no strong patterns that would help with developing a forecasting model. 6. In terms of forecasting, simple exponential smoothing generates a constant set of values. All forecasts equal the last value of the level component. Consequently, these forecasts are appropriate only when your time series data have no trend or seasonality.
  • 59. 53 (2.3)Limitation : • This Kaggle Supermarket sales data is not Original or Natural sales, So the Statistical analysis not justify in real life. • Also, in data we have product categories not actual product name, if we have actual product so we can also analysis “Market Basket Analysis”. • Here, we have only 3-month sales (89 days), if we have at least one-year sales so we can analyze monthly seasonal.
  • 60. 54 REFERENCE BOOKS: • Basic Econometrics by Damodar Gujarati • Programme Statistics by B.L. Agrawal • Statistics In Management Studies by K.K. Sharma WEBLIOGRAPHY: • https://epgp.inflibnet.ac.in/Home/ViewSubject?catid=34 • http://www.sussex.ac.uk/its/pdfs/SPSS_Forecasting_22.pdf • https://gilberttanner.com/blog/turn-your-data-science-script- into-websites-with-streamlit • https://www.kaggle.com/aungpyaeap/supermarket-sales/ (Data File) • https://myprojectreport.herokuapp.com/ (Dynamic Graphical Representation) SOFTWARES: • SPSS • SYSTAT • R SOFTWARE • MS EXCEL • PYTHON