The document discusses various techniques for data reduction including data sampling, data cleaning, data transformation, and data segmentation to break down large datasets into more manageable groups that provide better insight, as well as hierarchical and k-means clustering methods for grouping similar objects into clusters to analyze relationships in the data.
2. Data reduction - breaking down large sets
of data into more-manageable groups or
segments that provide better insight.
◦ Data sampling
◦ Data cleaning
◦ Data transformation
◦ Data segmentation
◦ Dimension reduction
2
4. Data sampling - extract a sample of data that is
relevant to the business problem under
consideration.
◦ A population includes all of the entities of interest in a study.
◦ A sample is a subset of the population, often randomly chosen
and preferably representative of the population as a whole.
Statistical inference focuses on drawing conclusions
about populations from samples.
◦ Estimation of population parameters
◦ Hypothesis testing – involves drawing conclusions about the
value of the parameters of one or more populations based on
sample data.
4
5. Sampling plan - a description of the approach
that is used to obtain samples from a population
prior to any data collection activity.
A sampling plan states:
its objectives
target population
population frame (the list from which the sample is
selected)
operational procedures for collecting data
statistical tools for data analysis
5
6. Example: A company wants to understand how golfers
might respond to a membership program that provides
discounts at golf courses.
◦ Objective - estimate the proportion of golfers who
would join the program
◦ Target population - golfers over 25 years old
◦ Population frame - golfers who purchased equipment
at particular stores
◦ Operational procedures - e-mail link to survey or direct-
mail questionnaire
◦ Statistical tools - PivotTables to summarize data by
demographic groups and estimate likelihood of joining
the program
6
7. Subjective sampling methods
◦ Judgment sampling – expert judgment is used to select the sample
◦ Convenience sampling – samples are selected based on the ease
with which the data can be collected
Probabilistic sampling methods
◦ Simple random sampling involves selecting items from a population
so that every subset of a given size has an equal chance of being
selected
◦ Systematic (periodic) sampling – a sampling plan that selects
every nth item from the population.
◦ Stratified sampling – applies to populations that are divided into
natural subsets (called strata) and allocates the appropriate
proportion of samples to each stratum.
◦ ...
7
8. We can determine the appropriate sample size
needed to estimate the population parameter
within a specified level of precision (± E).
Sample size for the mean:
Sample size for the proportion:
8
9. Using Analysis ToolPak Add-in
Data > Analysis
> Data Analysis > Sampling
9
10. Sales Transactions database
Data > Data Analysis > Sampling
Periodic selects every nth number
Random selects a simple random sample
Sampling is done with
replacement so
duplicates may occur.
10
11. XLMiner can sample from an Excel worksheet
XLMiner > Data > Get Data > Worksheet
11
12. Credit Risk Data
Click inside the database
XLMiner > Get Data >
Worksheet
Select variables and move
to right pane
Choose sampling options
12
14. Using sample data may limit our ability to predict
uncertain events that may occur because potential
values outside the range of the sample data are not
included.
A better approach is to identify the underlying probability
distribution from which sample data come by “fitting” a
theoretical distribution to the data and verifying the
goodness of fit statistically.
◦ Examine a histogram for clues about the distribution’s shape
◦ Look at summary statistics such as the mean, median, standard
deviation, coefficient of variation, and skewness
14
15. A random variable is a numerical description of
the outcome of an experiment.
◦ A discrete random variable is one for which the number
of possible outcomes can be counted.
◦ A continuous random variable has outcomes over one or
more continuous intervals of real numbers.
A probability distribution is a characterization of
the possible values that a random variable may
assume along with the probability of assuming
these values.
15
16. We may develop a probability distribution using
any one of the three perspectives of probability:
Classical: probabilities can be deduced from
theoretical arguments
Subjective: probabilities are based on
judgment and experience (This is often done in
creating decision models for phenomena for
which we have no historical data)
Relative frequency (empirical): probabilities
are based on the relative frequencies from a
sample of empirical data
16
17. Roll 2 dice
36 possible rolls (1,1), (1,2),…(6,5), (6,6)
Probability = number of ways of rolling a number
divided by 35; e.g., probability of a 3 is 2/36
Suppose two consumers try a new product.
Four outcomes:
1. like, like
2. like, dislike
3. dislike, like
4. dislike, dislike
Probability at least one dislikes product = 3/4
17
18. Distribution of an expert’s assessment of how the
DJIA (Dow Jones Industrial Average) might change
next year.
18
19. Airline Passengers
Sample data on passenger demand for 25 flights
◦ The histogram shows a relatively symmetric distribution. The
mean, median, and mode are all similar, although there is
moderate skewness. A normal distribution is not unreasonable.
19
20. Airport Service Times
Sample data on service times for 812 passengers at an
airport’s ticketing counter
◦ It is not clear what the distribution might be. It does not appear to
be exponential, but it might be lognormal or another distribution.
20
21. A better approach that simply visually examining a
histogram and summary statistics is to analytically fit
the data to the best type of probability distribution.
Three statistics measure goodness of fit:
◦ AIC/BIC (Akaike information criterion/Bayesian information
criterion)
◦ Chi-square (need at least 50 data points)
◦ Kolmogorov-Smirnov (works well for small samples)
◦ Anderson-Darling (puts more weight on the differences between
the tails of the distributions)
Analytic Solver Platform has the capability of fitting a
probability distribution to data.
21
22. 1. Highlight the data
Analytic Solver Platform
> Tools > Fit
2. Fit Options dialog
Type: Continuous
Test: Kolmorgov-Smirnov
Click Fit button
22
24. A random number is one that is uniformly
distributed between 0 to 1.
Excel function: =RAND( )
A value randomly generated from a specified
probability distribution is called a random
variate.
◦ Example: Uniform distribution
24
25. Analysis Toolpak Random
Number Generation Tool
◦ Can sample from uniform,
normal, Bernoulli, binomial,
Poisson, patterned, and discrete
distributions.
◦ Can also specify a random
number seed – a value from
which a stream of random
numbers is generated. By
specifying the same seed, you
can produce the same random
numbers at a later time.
25
26. Generate 100 outcomes
from a Poisson
distribution with a mean
of 12
◦ Number of Variables = 1
◦ Number of Random
Numbers = 100
◦ Distribution = Poisson
◦ Dialog changes and
prompts you to enter
Lambda (mean of Poisson)
= 12
26
29. In finance, one way of evaluating capital budgeting
projects is to compute a profitability index: PI = PV / I,
PV is the present value of future cash flows
I is the initial investment
What is the probability distribution of PI when PV is
estimated to be normally distributed with a mean of $12
million and a standard deviation of $2.5 million, and the
initial investment is also estimated to be normal with a
mean of $3.0 million and standard deviation of $0.8
million?
29
31. Analytic Solver Platform provides Excel functions
to generate random variates for many
distributions
31
32. An energy company was considering offering a new product and
needed to estimate the growth in PC ownership.
Using the best data and information available, they determined that
the minimum growth rate was 5.0%, the most likely value was 7.7%,
and the maximum value was 10.0% (a triangular distribution).
◦ A portion of 500 samples that were generated using the function
PsiTriangular(5%, 7.7%, 10%):
32
34. Real data sets that have missing values or
errors. Such data sets are called “dirty” and
need to be “cleaned” prior to analyzing
them.
◦ Handling missing data
◦ Handling outliers (observations that are radically
different from the rest)
34
35. Approaches for handling missing data.
◦ Eliminate the records/variables that contain missing
data
◦ Estimate reasonable values for missing observations,
such as the mean or median value
◦ Use a data mining procedure to deal with them.
XLMiner has the capability to deal with missing
data in the Transform menu in the Data Analysis
group.
35
36. XLMiner's Missing Data Handling utility allows users to
detect missing values in the dataset and handle them in
a specified way. XLMiner considers an observation to be
missing data if the cell is empty or contains an invalid
formula. In addition, it is also possible to treat cells
containing specific data as “missing”.
XLMiner offers several different methods for remedying
missing or invalid values. Each variable can be assigned
a different “treatment”. For example, the entire record
could be deleted if there is a missing value for one
variable, while the missing value could be replaced with
a specific value for another variable.
36
38. Examining the variables in the data set by means of summary
statistics, histograms, PivotTables, scatter plots, and other
tools can uncover data quality issues and outliers.
Some typical rules of thumb:
z-scores greater than +3 or less than -3
Extreme outliers are more than 3*IQR to the left of Q1 or right
of Q3
Mild outliers are between 1.5*IQR and 3*IQR to the left of Q1 or
right of Q3 Note:
* A standardized value, commonly called a z-
score, provides a relative measure of the
distance an observation is from the mean,
which is independent of the units of
measurement.
* The interquartile range (IQR), or the
midspread is the difference between the first
and third quartiles, Q3 – Q1. 38
39. Home Market Value data
None of the z-scores exceed 3. However, while
individual variables might not exhibit outliers,
combinations of them might.
◦ The last observation has a high market value ($120,700) but
a relatively small house size (1,581 square feet) and may be
an outlier.
39
40. Closer examination of outliers may reveal an error or a need
for further investigation to determine whether the observation
is relevant to the current analysis.
A conservative approach is to create two data sets, one with
and one without outliers, and then construct a model on both
data sets.
◦ If a model’s implications depend on the inclusion or
exclusion of outliers, then one should spend additional time
to track down the cause of the outliers.
40
42. Often data sets contain variables that, considered
separately, are not particularly insightful but that,
when combined as ratios, may represent
important relationships.
◦ Example: the price/earnings (PE) ratio
A critical task is determining how to represent the
measurements of the variables and which
variables to consider.
◦ Example: The variable Language with the possible values
of English, German, and Spanish would be replaced with
three binary variables called English, German, and
Spanish.
42
44. XLMiner provides a Transform Categorical
procedure under Transform in the Data Analysis
group.
This procedure provides options to create dummy
variables, create ordinal category scores, and
reduce categories by combining them into similar
groups.
44
48. In some cases, it may be
desirable to transform a
continuous variable into
categories. XLMiner provides a
Bin Continuous Data procedure
under Transform in the Data
Analysis group.
Caution: In general, transforming continuous variables into categories
because causes a loss of information (a continuous variable’s
category is less informative than a specific numeric value) and
increases the number of variables.
48
50. XLMiner calculates the interval as the (Maximum value for the x3
variable - Minimum value for the x3 variable) / #bins specified by
the user or in this instance (252 - 96) / 4 which equals 39.
Bin 12: Values 96 - < 135
Bin 15: Values 135 - < 174
Bin 18: Values 174 - 213
Bin 21: Values 213 - 252
50
52. Cluster analysis, also called data segmentation, is
a collection of techniques that seek to group or
segment a collection of objects (observations or
records) into subsets or clusters, such that those
within each cluster are more closely related to one
another than objects assigned to different
clusters.
◦ The objects within clusters should exhibit a high amount
of similarity, whereas those in different clusters will be
dissimilar.
52
54. Hierarchical clustering: The data are not partitioned
into a particular cluster in a single step. Instead, a
series of partitions takes place, which may run from
a single cluster containing all objects to n clusters,
each containing a single object.
k-Means clustering: Given a value of k, the k-means
algorithm randomly partitions the observations into
k clusters. After all observations have been
assigned to a cluster, the resulting cluster centroids
are calculated. Using the updated cluster centroids,
all observations are reassigned to the cluster with
the closest centroid.
54
55. • Agglomerative clustering methods proceed by series of fusions of the
n objects into groups this is the method implemented in XLMiner
• Divisive clustering methods separate n objects successively into
finer groupings 55
56. Euclidean distance is the straight-line distance
between two points
The Euclidean distance measure between two
points (x1, x2, . . . , xn) and (y1, y2, . . . , yn) is
56
57. Single linkage clustering (nearest-neighbor)
The distance between groups is defined as the distance between the closest
pair of objects, where only pairs consisting of one object from each group are
considered. At each stage, the closest 2 clusters are merged
Complete linkage clustering
The distance between groups is the distance between the most distant pair of
objects, one from each group
Average linkage clustering
The distance between two clusters is defined as the average of distances
between all pairs of objects, where each pair is made up of one object from
each group.
Average group linkage clustering
Uses the mean values for each variable to compute distances between clusters
Ward’s hierarchical clustering
Uses a sum of squares criterion
Different methods generally yield
different results, so it is best to
experiment and compare the results.
57
59. Colleges and
Universities Data
Cluster the
institutions using the
five numeric columns
in the data set.
XLMiner > Data
Analysis > Cluster >
Hierarchical
Clustering
Note: We are clustering the
numerical variables, so
School and Type are not
included. 59
60. Check the box
Normalize input data to
ensure that the distance
measure accords equal
weight to each variable
Use the Euclidean
distance as the
similarity measure for
numeric data.
Select the clustering
method you wish to
use.
60
61. Select the number of clusters (The agglomerative
method of hierarchical clustering keeps forming
clusters until only one cluster is left. This option
lets you stop the process at a given number of
clusters.) We selected four clusters.
61
63. Dendogram
illustrates the
fusions or
divisions made
at each
successive stage
of analysis
A horizontal line
shows the
cluster partitions
63
64. Predicted clusters
◦ shows the assignment of
observations to the number
of clusters we specified in
the input dialog, (in this
case four)
Cluster # Colleges
1 23
2 22
3 3
4 1
64
66. Dimension reduction - process of removing
variables from the analysis without losing any
crucial information.
◦ One way is to examine pairwise correlations to detect
variables or groups of variables that may supply
similar information. Such variables can be aggregated
or removed to allow more parsimonious model
development.
Dimension reduction in XLMiner
◦ Feature selection
◦ Principal components
66
67. Feature Selection
attempts to identify the
best subset of variables
(or features) out of the
available variables (or
features) to be used as
input to a classification or
prediction method.
67
69. The Principal Components
procedure can be found on
the XLMiner tab under
Transform in the Data
Analysis group.
Principal components analysis creates a
collection of metavariables (components) that are
weighted sums of the original variables. These
components are uncorrelated with each other and
often only a few of them are needed to convey the
same information as the large set of original
variables. 69
71. Data reduction - breaking down large sets
of data into more-manageable groups or
segments that provide better insight.
◦ Data sampling
◦ Data cleaning
◦ Data transformation
◦ Data segmentation
◦ Dimension reduction
71