SlideShare uma empresa Scribd logo
1 de 72
Baixar para ler offline
DESCRIPTIVE ANALYTICS
DATAREDUCTION
 Data reduction - breaking down large sets
of data into more-manageable groups or
segments that provide better insight.
◦ Data sampling
◦ Data cleaning
◦ Data transformation
◦ Data segmentation
◦ Dimension reduction
2
DATA SAMPLING
 Data sampling - extract a sample of data that is
relevant to the business problem under
consideration.
◦ A population includes all of the entities of interest in a study.
◦ A sample is a subset of the population, often randomly chosen
and preferably representative of the population as a whole.
 Statistical inference focuses on drawing conclusions
about populations from samples.
◦ Estimation of population parameters
◦ Hypothesis testing – involves drawing conclusions about the
value of the parameters of one or more populations based on
sample data.
4
 Sampling plan - a description of the approach
that is used to obtain samples from a population
prior to any data collection activity.
 A sampling plan states:
 its objectives
 target population
 population frame (the list from which the sample is
selected)
 operational procedures for collecting data
 statistical tools for data analysis
5
 Example: A company wants to understand how golfers
might respond to a membership program that provides
discounts at golf courses.
◦ Objective - estimate the proportion of golfers who
would join the program
◦ Target population - golfers over 25 years old
◦ Population frame - golfers who purchased equipment
at particular stores
◦ Operational procedures - e-mail link to survey or direct-
mail questionnaire
◦ Statistical tools - PivotTables to summarize data by
demographic groups and estimate likelihood of joining
the program
6
 Subjective sampling methods
◦ Judgment sampling – expert judgment is used to select the sample
◦ Convenience sampling – samples are selected based on the ease
with which the data can be collected
 Probabilistic sampling methods
◦ Simple random sampling involves selecting items from a population
so that every subset of a given size has an equal chance of being
selected
◦ Systematic (periodic) sampling – a sampling plan that selects
every nth item from the population.
◦ Stratified sampling – applies to populations that are divided into
natural subsets (called strata) and allocates the appropriate
proportion of samples to each stratum.
◦ ...
7
 We can determine the appropriate sample size
needed to estimate the population parameter
within a specified level of precision (± E).
 Sample size for the mean:
 Sample size for the proportion:
8
 Using Analysis ToolPak Add-in
 Data > Analysis
> Data Analysis > Sampling
9
 Sales Transactions database
 Data > Data Analysis > Sampling
 Periodic selects every nth number
 Random selects a simple random sample
Sampling is done with
replacement so
duplicates may occur.
10
 XLMiner can sample from an Excel worksheet
XLMiner > Data > Get Data > Worksheet
11
 Credit Risk Data
 Click inside the database
 XLMiner > Get Data >
Worksheet
 Select variables and move
to right pane
 Choose sampling options
12
 Results
13
 Using sample data may limit our ability to predict
uncertain events that may occur because potential
values outside the range of the sample data are not
included.
 A better approach is to identify the underlying probability
distribution from which sample data come by “fitting” a
theoretical distribution to the data and verifying the
goodness of fit statistically.
◦ Examine a histogram for clues about the distribution’s shape
◦ Look at summary statistics such as the mean, median, standard
deviation, coefficient of variation, and skewness
14
 A random variable is a numerical description of
the outcome of an experiment.
◦ A discrete random variable is one for which the number
of possible outcomes can be counted.
◦ A continuous random variable has outcomes over one or
more continuous intervals of real numbers.
 A probability distribution is a characterization of
the possible values that a random variable may
assume along with the probability of assuming
these values.
15
 We may develop a probability distribution using
any one of the three perspectives of probability:
 Classical: probabilities can be deduced from
theoretical arguments
 Subjective: probabilities are based on
judgment and experience (This is often done in
creating decision models for phenomena for
which we have no historical data)
 Relative frequency (empirical): probabilities
are based on the relative frequencies from a
sample of empirical data
16
Roll 2 dice
 36 possible rolls (1,1), (1,2),…(6,5), (6,6)
 Probability = number of ways of rolling a number
divided by 35; e.g., probability of a 3 is 2/36
Suppose two consumers try a new product.
 Four outcomes:
1. like, like
2. like, dislike
3. dislike, like
4. dislike, dislike
 Probability at least one dislikes product = 3/4
17
 Distribution of an expert’s assessment of how the
DJIA (Dow Jones Industrial Average) might change
next year.
18
 Airline Passengers
 Sample data on passenger demand for 25 flights
◦ The histogram shows a relatively symmetric distribution. The
mean, median, and mode are all similar, although there is
moderate skewness. A normal distribution is not unreasonable.
19
 Airport Service Times
 Sample data on service times for 812 passengers at an
airport’s ticketing counter
◦ It is not clear what the distribution might be. It does not appear to
be exponential, but it might be lognormal or another distribution.
20
 A better approach that simply visually examining a
histogram and summary statistics is to analytically fit
the data to the best type of probability distribution.
 Three statistics measure goodness of fit:
◦ AIC/BIC (Akaike information criterion/Bayesian information
criterion)
◦ Chi-square (need at least 50 data points)
◦ Kolmogorov-Smirnov (works well for small samples)
◦ Anderson-Darling (puts more weight on the differences between
the tails of the distributions)
 Analytic Solver Platform has the capability of fitting a
probability distribution to data.
21
1. Highlight the data
Analytic Solver Platform
> Tools > Fit
2. Fit Options dialog
Type: Continuous
Test: Kolmorgov-Smirnov
Click Fit button
22
 The best-fitting distribution is called an Erlang
distribution.
23
 A random number is one that is uniformly
distributed between 0 to 1.
 Excel function: =RAND( )
 A value randomly generated from a specified
probability distribution is called a random
variate.
◦ Example: Uniform distribution
24
 Analysis Toolpak Random
Number Generation Tool
◦ Can sample from uniform,
normal, Bernoulli, binomial,
Poisson, patterned, and discrete
distributions.
◦ Can also specify a random
number seed – a value from
which a stream of random
numbers is generated. By
specifying the same seed, you
can produce the same random
numbers at a later time.
25
 Generate 100 outcomes
from a Poisson
distribution with a mean
of 12
◦ Number of Variables = 1
◦ Number of Random
Numbers = 100
◦ Distribution = Poisson
◦ Dialog changes and
prompts you to enter
Lambda (mean of Poisson)
= 12
26
 Results
(Histogram created manually)
27
 Normal: =NORM.INV(RAND( ), mean, stdev)
 Standard normal: =NORM.S.INV(RAND( ))
28
 In finance, one way of evaluating capital budgeting
projects is to compute a profitability index: PI = PV / I,
 PV is the present value of future cash flows
 I is the initial investment
 What is the probability distribution of PI when PV is
estimated to be normally distributed with a mean of $12
million and a standard deviation of $2.5 million, and the
initial investment is also estimated to be normal with a
mean of $3.0 million and standard deviation of $0.8
million?
29
 Column F:
=NORM.INV(RAND(), 12, 2.5)
 Column G:
=NORM.INV(RAND(), 3, 0.8)
30
 Analytic Solver Platform provides Excel functions
to generate random variates for many
distributions
31
 An energy company was considering offering a new product and
needed to estimate the growth in PC ownership.
 Using the best data and information available, they determined that
the minimum growth rate was 5.0%, the most likely value was 7.7%,
and the maximum value was 10.0% (a triangular distribution).
◦ A portion of 500 samples that were generated using the function
PsiTriangular(5%, 7.7%, 10%):
32
DATA CLEANING
 Real data sets that have missing values or
errors. Such data sets are called “dirty” and
need to be “cleaned” prior to analyzing
them.
◦ Handling missing data
◦ Handling outliers (observations that are radically
different from the rest)
34
 Approaches for handling missing data.
◦ Eliminate the records/variables that contain missing
data
◦ Estimate reasonable values for missing observations,
such as the mean or median value
◦ Use a data mining procedure to deal with them.
 XLMiner has the capability to deal with missing
data in the Transform menu in the Data Analysis
group.
35
 XLMiner's Missing Data Handling utility allows users to
detect missing values in the dataset and handle them in
a specified way. XLMiner considers an observation to be
missing data if the cell is empty or contains an invalid
formula. In addition, it is also possible to treat cells
containing specific data as “missing”.
 XLMiner offers several different methods for remedying
missing or invalid values. Each variable can be assigned
a different “treatment”. For example, the entire record
could be deleted if there is a missing value for one
variable, while the missing value could be replaced with
a specific value for another variable.
36
37
 Examining the variables in the data set by means of summary
statistics, histograms, PivotTables, scatter plots, and other
tools can uncover data quality issues and outliers.
 Some typical rules of thumb:
 z-scores greater than +3 or less than -3
 Extreme outliers are more than 3*IQR to the left of Q1 or right
of Q3
 Mild outliers are between 1.5*IQR and 3*IQR to the left of Q1 or
right of Q3 Note:
* A standardized value, commonly called a z-
score, provides a relative measure of the
distance an observation is from the mean,
which is independent of the units of
measurement.
* The interquartile range (IQR), or the
midspread is the difference between the first
and third quartiles, Q3 – Q1. 38
 Home Market Value data
 None of the z-scores exceed 3. However, while
individual variables might not exhibit outliers,
combinations of them might.
◦ The last observation has a high market value ($120,700) but
a relatively small house size (1,581 square feet) and may be
an outlier.
39
 Closer examination of outliers may reveal an error or a need
for further investigation to determine whether the observation
is relevant to the current analysis.
 A conservative approach is to create two data sets, one with
and one without outliers, and then construct a model on both
data sets.
◦ If a model’s implications depend on the inclusion or
exclusion of outliers, then one should spend additional time
to track down the cause of the outliers.
40
DATA TRANSFORMATION
 Often data sets contain variables that, considered
separately, are not particularly insightful but that,
when combined as ratios, may represent
important relationships.
◦ Example: the price/earnings (PE) ratio
 A critical task is determining how to represent the
measurements of the variables and which
variables to consider.
◦ Example: The variable Language with the possible values
of English, German, and Spanish would be replaced with
three binary variables called English, German, and
Spanish.
42
v  [min, max]
v’ = [(v - min)/(max - min)] *
(max_new – min_new) + min_new
 v’  [min_new, max_new]
43
 XLMiner provides a Transform Categorical
procedure under Transform in the Data Analysis
group.
 This procedure provides options to create dummy
variables, create ordinal category scores, and
reduce categories by combining them into similar
groups.
44
45
46
47
 In some cases, it may be
desirable to transform a
continuous variable into
categories. XLMiner provides a
Bin Continuous Data procedure
under Transform in the Data
Analysis group.
 Caution: In general, transforming continuous variables into categories
because causes a loss of information (a continuous variable’s
category is less informative than a specific numeric value) and
increases the number of variables.
48
49
XLMiner calculates the interval as the (Maximum value for the x3
variable - Minimum value for the x3 variable) / #bins specified by
the user or in this instance (252 - 96) / 4 which equals 39.
Bin 12: Values 96 - < 135
Bin 15: Values 135 - < 174
Bin 18: Values 174 - 213
Bin 21: Values 213 - 252
50
CLUSTER ANALYSIS
 Cluster analysis, also called data segmentation, is
a collection of techniques that seek to group or
segment a collection of objects (observations or
records) into subsets or clusters, such that those
within each cluster are more closely related to one
another than objects assigned to different
clusters.
◦ The objects within clusters should exhibit a high amount
of similarity, whereas those in different clusters will be
dissimilar.
52
53
 Hierarchical clustering: The data are not partitioned
into a particular cluster in a single step. Instead, a
series of partitions takes place, which may run from
a single cluster containing all objects to n clusters,
each containing a single object.
 k-Means clustering: Given a value of k, the k-means
algorithm randomly partitions the observations into
k clusters. After all observations have been
assigned to a cluster, the resulting cluster centroids
are calculated. Using the updated cluster centroids,
all observations are reassigned to the cluster with
the closest centroid.
54
• Agglomerative clustering methods proceed by series of fusions of the
n objects into groups  this is the method implemented in XLMiner
• Divisive clustering methods separate n objects successively into
finer groupings 55
 Euclidean distance is the straight-line distance
between two points
 The Euclidean distance measure between two
points (x1, x2, . . . , xn) and (y1, y2, . . . , yn) is
56
 Single linkage clustering (nearest-neighbor)
The distance between groups is defined as the distance between the closest
pair of objects, where only pairs consisting of one object from each group are
considered. At each stage, the closest 2 clusters are merged
 Complete linkage clustering
The distance between groups is the distance between the most distant pair of
objects, one from each group
 Average linkage clustering
The distance between two clusters is defined as the average of distances
between all pairs of objects, where each pair is made up of one object from
each group.
 Average group linkage clustering
Uses the mean values for each variable to compute distances between clusters
 Ward’s hierarchical clustering
Uses a sum of squares criterion
Different methods generally yield
different results, so it is best to
experiment and compare the results.
57
58
 Colleges and
Universities Data
 Cluster the
institutions using the
five numeric columns
in the data set.
 XLMiner > Data
Analysis > Cluster >
Hierarchical
Clustering
Note: We are clustering the
numerical variables, so
School and Type are not
included. 59
 Check the box
Normalize input data to
ensure that the distance
measure accords equal
weight to each variable
 Use the Euclidean
distance as the
similarity measure for
numeric data.
 Select the clustering
method you wish to
use.
60
 Select the number of clusters (The agglomerative
method of hierarchical clustering keeps forming
clusters until only one cluster is left. This option
lets you stop the process at a given number of
clusters.)  We selected four clusters.
61
 Results
62
 Dendogram
illustrates the
fusions or
divisions made
at each
successive stage
of analysis
 A horizontal line
shows the
cluster partitions
63
 Predicted clusters
◦ shows the assignment of
observations to the number
of clusters we specified in
the input dialog, (in this
case four)
Cluster # Colleges
1 23
2 22
3 3
4 1
64
DIMENSION REDUCTION
 Dimension reduction - process of removing
variables from the analysis without losing any
crucial information.
◦ One way is to examine pairwise correlations to detect
variables or groups of variables that may supply
similar information. Such variables can be aggregated
or removed to allow more parsimonious model
development.
 Dimension reduction in XLMiner
◦ Feature selection
◦ Principal components
66
 Feature Selection
attempts to identify the
best subset of variables
(or features) out of the
available variables (or
features) to be used as
input to a classification or
prediction method.
67
68
 The Principal Components
procedure can be found on
the XLMiner tab under
Transform in the Data
Analysis group.
 Principal components analysis creates a
collection of metavariables (components) that are
weighted sums of the original variables. These
components are uncorrelated with each other and
often only a few of them are needed to convey the
same information as the large set of original
variables. 69
70
 Data reduction - breaking down large sets
of data into more-manageable groups or
segments that provide better insight.
◦ Data sampling
◦ Data cleaning
◦ Data transformation
◦ Data segmentation
◦ Dimension reduction





71
3-72

Mais conteúdo relacionado

Mais procurados

Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisgokulprasath06
 
Exploratory data analysis project
Exploratory data analysis project Exploratory data analysis project
Exploratory data analysis project BabatundeSogunro
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...ranjit banshpal
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7Birat Sharma
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mininghktripathy
 
Data What Type Of Data Do You Have V2.1
Data   What Type Of Data Do You Have V2.1Data   What Type Of Data Do You Have V2.1
Data What Type Of Data Do You Have V2.1TimKasse
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Miningijsrd.com
 
PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisAashish Patel
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning ClusteringRupak Roy
 

Mais procurados (20)

Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data analysis
Data analysisData analysis
Data analysis
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Exploratory data analysis project
Exploratory data analysis project Exploratory data analysis project
Exploratory data analysis project
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
All About Big Data
All About Big Data All About Big Data
All About Big Data
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mining
 
Data What Type Of Data Do You Have V2.1
Data   What Type Of Data Do You Have V2.1Data   What Type Of Data Do You Have V2.1
Data What Type Of Data Do You Have V2.1
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
 
PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data Analysis
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
 

Destaque

[Quản trị kinh doanh cho kỹ sư] Bài 4 - Quản lý tiếp thị
 [Quản trị kinh doanh cho kỹ sư] Bài 4 - Quản lý tiếp thị [Quản trị kinh doanh cho kỹ sư] Bài 4 - Quản lý tiếp thị
[Quản trị kinh doanh cho kỹ sư] Bài 4 - Quản lý tiếp thịNguyen Ngoc Binh Phuong
 
Tài liệu hướng dẫn tự làm kế toán trên excel
Tài liệu hướng dẫn tự làm kế toán trên excelTài liệu hướng dẫn tự làm kế toán trên excel
Tài liệu hướng dẫn tự làm kế toán trên excelVân Lavie
 
toán kinh tế hungary
toán kinh tế  hungary toán kinh tế  hungary
toán kinh tế hungary langtukju_1088
 
Business Model Canvas (Khung Mô Hình Kinh Doanh)
Business Model Canvas (Khung Mô Hình Kinh Doanh)Business Model Canvas (Khung Mô Hình Kinh Doanh)
Business Model Canvas (Khung Mô Hình Kinh Doanh)Nguyen Ngoc Binh Phuong
 
Bt toi uu hoa
Bt toi uu hoaBt toi uu hoa
Bt toi uu hoaThien Le
 
Phân tích hoạt động kinh doanh
Phân tích hoạt động kinh doanhPhân tích hoạt động kinh doanh
Phân tích hoạt động kinh doanh
 
HƯỚNG DẪN SỬ DỤNG DATA-TABLE
HƯỚNG DẪN SỬ DỤNG DATA-TABLEHƯỚNG DẪN SỬ DỤNG DATA-TABLE
HƯỚNG DẪN SỬ DỤNG DATA-TABLEhoang_duyuyen
 
Chuong 4 ung dung_cntt_trong_kt
Chuong 4 ung dung_cntt_trong_ktChuong 4 ung dung_cntt_trong_kt
Chuong 4 ung dung_cntt_trong_kt
 
Trac nghiem nhan_cach-_dalailama
Trac nghiem nhan_cach-_dalailamaTrac nghiem nhan_cach-_dalailama
Trac nghiem nhan_cach-_dalailamaViet solution
 
Poliempreende2016: Software de Apoio à Gestão
Poliempreende2016: Software de Apoio à GestãoPoliempreende2016: Software de Apoio à Gestão
Poliempreende2016: Software de Apoio à GestãoVitor Gonçalves
 

Destaque (12)

[Quản trị kinh doanh cho kỹ sư] Bài 4 - Quản lý tiếp thị
 [Quản trị kinh doanh cho kỹ sư] Bài 4 - Quản lý tiếp thị [Quản trị kinh doanh cho kỹ sư] Bài 4 - Quản lý tiếp thị
[Quản trị kinh doanh cho kỹ sư] Bài 4 - Quản lý tiếp thị
 
Tài liệu hướng dẫn tự làm kế toán trên excel
Tài liệu hướng dẫn tự làm kế toán trên excelTài liệu hướng dẫn tự làm kế toán trên excel
Tài liệu hướng dẫn tự làm kế toán trên excel
 
[MPKD] Tu duy he thong (phan 2)
[MPKD] Tu duy he thong (phan 2)[MPKD] Tu duy he thong (phan 2)
[MPKD] Tu duy he thong (phan 2)
 
toán kinh tế hungary
toán kinh tế  hungary toán kinh tế  hungary
toán kinh tế hungary
 
Business Model Canvas (Khung Mô Hình Kinh Doanh)
Business Model Canvas (Khung Mô Hình Kinh Doanh)Business Model Canvas (Khung Mô Hình Kinh Doanh)
Business Model Canvas (Khung Mô Hình Kinh Doanh)
 
Bt toi uu hoa
Bt toi uu hoaBt toi uu hoa
Bt toi uu hoa
 
Phân tích hoạt động kinh doanh
Phân tích hoạt động kinh doanhPhân tích hoạt động kinh doanh
Phân tích hoạt động kinh doanh
 
Tin hoc ung dung
Tin hoc ung dungTin hoc ung dung
Tin hoc ung dung
 
HƯỚNG DẪN SỬ DỤNG DATA-TABLE
HƯỚNG DẪN SỬ DỤNG DATA-TABLEHƯỚNG DẪN SỬ DỤNG DATA-TABLE
HƯỚNG DẪN SỬ DỤNG DATA-TABLE
 
Chuong 4 ung dung_cntt_trong_kt
Chuong 4 ung dung_cntt_trong_ktChuong 4 ung dung_cntt_trong_kt
Chuong 4 ung dung_cntt_trong_kt
 
Trac nghiem nhan_cach-_dalailama
Trac nghiem nhan_cach-_dalailamaTrac nghiem nhan_cach-_dalailama
Trac nghiem nhan_cach-_dalailama
 
Poliempreende2016: Software de Apoio à Gestão
Poliempreende2016: Software de Apoio à GestãoPoliempreende2016: Software de Apoio à Gestão
Poliempreende2016: Software de Apoio à Gestão
 

Semelhante a Descriptive Analytics: Data Reduction

Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsHarsh Parekh
 
7 qc tools
7 qc tools7 qc tools
7 qc toolskmsonam
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdfthaersyam
 
CPSC 531: System Modeling and Simulation.pptx
CPSC 531:System Modeling and Simulation.pptxCPSC 531:System Modeling and Simulation.pptx
CPSC 531: System Modeling and Simulation.pptxFarhan27013
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISBabasID2
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive StatisticsCIToolkit
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesAnkurTiwari813070
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics Bahzad5
 
Module 3 Identifying fraud in forensic analysis.pptx
Module 3 Identifying fraud in forensic analysis.pptxModule 3 Identifying fraud in forensic analysis.pptx
Module 3 Identifying fraud in forensic analysis.pptxIqbalAli61
 
Back to the basics-Part2: Data exploration: representing and testing data pro...
Back to the basics-Part2: Data exploration: representing and testing data pro...Back to the basics-Part2: Data exploration: representing and testing data pro...
Back to the basics-Part2: Data exploration: representing and testing data pro...Giannis Tsakonas
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCsharondabriggs
 

Semelhante a Descriptive Analytics: Data Reduction (20)

Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data Analytics
 
7 qc tools
7 qc tools7 qc tools
7 qc tools
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 
Chapter 18
Chapter 18Chapter 18
Chapter 18
 
Presentation of BRM.pptx
Presentation of BRM.pptxPresentation of BRM.pptx
Presentation of BRM.pptx
 
CPSC 531: System Modeling and Simulation.pptx
CPSC 531:System Modeling and Simulation.pptxCPSC 531:System Modeling and Simulation.pptx
CPSC 531: System Modeling and Simulation.pptx
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
 
Descriptive Analysis.pptx
Descriptive Analysis.pptxDescriptive Analysis.pptx
Descriptive Analysis.pptx
 
Basic Statistics to start Analytics
Basic Statistics to start AnalyticsBasic Statistics to start Analytics
Basic Statistics to start Analytics
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Lecture_note1.pdf
Lecture_note1.pdfLecture_note1.pdf
Lecture_note1.pdf
 
DATA COLLECTION IN RESEARCH
DATA COLLECTION IN RESEARCHDATA COLLECTION IN RESEARCH
DATA COLLECTION IN RESEARCH
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notes
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
SPC,SQC & QC TOOLS
SPC,SQC & QC TOOLSSPC,SQC & QC TOOLS
SPC,SQC & QC TOOLS
 
Module 3 Identifying fraud in forensic analysis.pptx
Module 3 Identifying fraud in forensic analysis.pptxModule 3 Identifying fraud in forensic analysis.pptx
Module 3 Identifying fraud in forensic analysis.pptx
 
Back to the basics-Part2: Data exploration: representing and testing data pro...
Back to the basics-Part2: Data exploration: representing and testing data pro...Back to the basics-Part2: Data exploration: representing and testing data pro...
Back to the basics-Part2: Data exploration: representing and testing data pro...
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisC
 
Statistics.pdf
Statistics.pdfStatistics.pdf
Statistics.pdf
 

Último

VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...Suhani Kapoor
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdftbatkhuu1
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...amitlee9823
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxAndy Lambert
 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Delhi Call girls
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst SummitHolger Mueller
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsMichael W. Hawkins
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMANIlamathiKannappan
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsP&CO
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insightsseri bangash
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Dave Litwiller
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
Cracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxCracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxWorkforce Group
 

Último (20)

VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
 
Forklift Operations: Safety through Cartoons
Forklift Operations: Safety through CartoonsForklift Operations: Safety through Cartoons
Forklift Operations: Safety through Cartoons
 
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdf
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptx
 
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
Best VIP Call Girls Noida Sector 40 Call Me: 8448380779
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst Summit
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael Hawkins
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMAN
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insights
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
Cracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxCracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptx
 

Descriptive Analytics: Data Reduction

  • 2.  Data reduction - breaking down large sets of data into more-manageable groups or segments that provide better insight. ◦ Data sampling ◦ Data cleaning ◦ Data transformation ◦ Data segmentation ◦ Dimension reduction 2
  • 4.  Data sampling - extract a sample of data that is relevant to the business problem under consideration. ◦ A population includes all of the entities of interest in a study. ◦ A sample is a subset of the population, often randomly chosen and preferably representative of the population as a whole.  Statistical inference focuses on drawing conclusions about populations from samples. ◦ Estimation of population parameters ◦ Hypothesis testing – involves drawing conclusions about the value of the parameters of one or more populations based on sample data. 4
  • 5.  Sampling plan - a description of the approach that is used to obtain samples from a population prior to any data collection activity.  A sampling plan states:  its objectives  target population  population frame (the list from which the sample is selected)  operational procedures for collecting data  statistical tools for data analysis 5
  • 6.  Example: A company wants to understand how golfers might respond to a membership program that provides discounts at golf courses. ◦ Objective - estimate the proportion of golfers who would join the program ◦ Target population - golfers over 25 years old ◦ Population frame - golfers who purchased equipment at particular stores ◦ Operational procedures - e-mail link to survey or direct- mail questionnaire ◦ Statistical tools - PivotTables to summarize data by demographic groups and estimate likelihood of joining the program 6
  • 7.  Subjective sampling methods ◦ Judgment sampling – expert judgment is used to select the sample ◦ Convenience sampling – samples are selected based on the ease with which the data can be collected  Probabilistic sampling methods ◦ Simple random sampling involves selecting items from a population so that every subset of a given size has an equal chance of being selected ◦ Systematic (periodic) sampling – a sampling plan that selects every nth item from the population. ◦ Stratified sampling – applies to populations that are divided into natural subsets (called strata) and allocates the appropriate proportion of samples to each stratum. ◦ ... 7
  • 8.  We can determine the appropriate sample size needed to estimate the population parameter within a specified level of precision (± E).  Sample size for the mean:  Sample size for the proportion: 8
  • 9.  Using Analysis ToolPak Add-in  Data > Analysis > Data Analysis > Sampling 9
  • 10.  Sales Transactions database  Data > Data Analysis > Sampling  Periodic selects every nth number  Random selects a simple random sample Sampling is done with replacement so duplicates may occur. 10
  • 11.  XLMiner can sample from an Excel worksheet XLMiner > Data > Get Data > Worksheet 11
  • 12.  Credit Risk Data  Click inside the database  XLMiner > Get Data > Worksheet  Select variables and move to right pane  Choose sampling options 12
  • 14.  Using sample data may limit our ability to predict uncertain events that may occur because potential values outside the range of the sample data are not included.  A better approach is to identify the underlying probability distribution from which sample data come by “fitting” a theoretical distribution to the data and verifying the goodness of fit statistically. ◦ Examine a histogram for clues about the distribution’s shape ◦ Look at summary statistics such as the mean, median, standard deviation, coefficient of variation, and skewness 14
  • 15.  A random variable is a numerical description of the outcome of an experiment. ◦ A discrete random variable is one for which the number of possible outcomes can be counted. ◦ A continuous random variable has outcomes over one or more continuous intervals of real numbers.  A probability distribution is a characterization of the possible values that a random variable may assume along with the probability of assuming these values. 15
  • 16.  We may develop a probability distribution using any one of the three perspectives of probability:  Classical: probabilities can be deduced from theoretical arguments  Subjective: probabilities are based on judgment and experience (This is often done in creating decision models for phenomena for which we have no historical data)  Relative frequency (empirical): probabilities are based on the relative frequencies from a sample of empirical data 16
  • 17. Roll 2 dice  36 possible rolls (1,1), (1,2),…(6,5), (6,6)  Probability = number of ways of rolling a number divided by 35; e.g., probability of a 3 is 2/36 Suppose two consumers try a new product.  Four outcomes: 1. like, like 2. like, dislike 3. dislike, like 4. dislike, dislike  Probability at least one dislikes product = 3/4 17
  • 18.  Distribution of an expert’s assessment of how the DJIA (Dow Jones Industrial Average) might change next year. 18
  • 19.  Airline Passengers  Sample data on passenger demand for 25 flights ◦ The histogram shows a relatively symmetric distribution. The mean, median, and mode are all similar, although there is moderate skewness. A normal distribution is not unreasonable. 19
  • 20.  Airport Service Times  Sample data on service times for 812 passengers at an airport’s ticketing counter ◦ It is not clear what the distribution might be. It does not appear to be exponential, but it might be lognormal or another distribution. 20
  • 21.  A better approach that simply visually examining a histogram and summary statistics is to analytically fit the data to the best type of probability distribution.  Three statistics measure goodness of fit: ◦ AIC/BIC (Akaike information criterion/Bayesian information criterion) ◦ Chi-square (need at least 50 data points) ◦ Kolmogorov-Smirnov (works well for small samples) ◦ Anderson-Darling (puts more weight on the differences between the tails of the distributions)  Analytic Solver Platform has the capability of fitting a probability distribution to data. 21
  • 22. 1. Highlight the data Analytic Solver Platform > Tools > Fit 2. Fit Options dialog Type: Continuous Test: Kolmorgov-Smirnov Click Fit button 22
  • 23.  The best-fitting distribution is called an Erlang distribution. 23
  • 24.  A random number is one that is uniformly distributed between 0 to 1.  Excel function: =RAND( )  A value randomly generated from a specified probability distribution is called a random variate. ◦ Example: Uniform distribution 24
  • 25.  Analysis Toolpak Random Number Generation Tool ◦ Can sample from uniform, normal, Bernoulli, binomial, Poisson, patterned, and discrete distributions. ◦ Can also specify a random number seed – a value from which a stream of random numbers is generated. By specifying the same seed, you can produce the same random numbers at a later time. 25
  • 26.  Generate 100 outcomes from a Poisson distribution with a mean of 12 ◦ Number of Variables = 1 ◦ Number of Random Numbers = 100 ◦ Distribution = Poisson ◦ Dialog changes and prompts you to enter Lambda (mean of Poisson) = 12 26
  • 28.  Normal: =NORM.INV(RAND( ), mean, stdev)  Standard normal: =NORM.S.INV(RAND( )) 28
  • 29.  In finance, one way of evaluating capital budgeting projects is to compute a profitability index: PI = PV / I,  PV is the present value of future cash flows  I is the initial investment  What is the probability distribution of PI when PV is estimated to be normally distributed with a mean of $12 million and a standard deviation of $2.5 million, and the initial investment is also estimated to be normal with a mean of $3.0 million and standard deviation of $0.8 million? 29
  • 30.  Column F: =NORM.INV(RAND(), 12, 2.5)  Column G: =NORM.INV(RAND(), 3, 0.8) 30
  • 31.  Analytic Solver Platform provides Excel functions to generate random variates for many distributions 31
  • 32.  An energy company was considering offering a new product and needed to estimate the growth in PC ownership.  Using the best data and information available, they determined that the minimum growth rate was 5.0%, the most likely value was 7.7%, and the maximum value was 10.0% (a triangular distribution). ◦ A portion of 500 samples that were generated using the function PsiTriangular(5%, 7.7%, 10%): 32
  • 34.  Real data sets that have missing values or errors. Such data sets are called “dirty” and need to be “cleaned” prior to analyzing them. ◦ Handling missing data ◦ Handling outliers (observations that are radically different from the rest) 34
  • 35.  Approaches for handling missing data. ◦ Eliminate the records/variables that contain missing data ◦ Estimate reasonable values for missing observations, such as the mean or median value ◦ Use a data mining procedure to deal with them.  XLMiner has the capability to deal with missing data in the Transform menu in the Data Analysis group. 35
  • 36.  XLMiner's Missing Data Handling utility allows users to detect missing values in the dataset and handle them in a specified way. XLMiner considers an observation to be missing data if the cell is empty or contains an invalid formula. In addition, it is also possible to treat cells containing specific data as “missing”.  XLMiner offers several different methods for remedying missing or invalid values. Each variable can be assigned a different “treatment”. For example, the entire record could be deleted if there is a missing value for one variable, while the missing value could be replaced with a specific value for another variable. 36
  • 37. 37
  • 38.  Examining the variables in the data set by means of summary statistics, histograms, PivotTables, scatter plots, and other tools can uncover data quality issues and outliers.  Some typical rules of thumb:  z-scores greater than +3 or less than -3  Extreme outliers are more than 3*IQR to the left of Q1 or right of Q3  Mild outliers are between 1.5*IQR and 3*IQR to the left of Q1 or right of Q3 Note: * A standardized value, commonly called a z- score, provides a relative measure of the distance an observation is from the mean, which is independent of the units of measurement. * The interquartile range (IQR), or the midspread is the difference between the first and third quartiles, Q3 – Q1. 38
  • 39.  Home Market Value data  None of the z-scores exceed 3. However, while individual variables might not exhibit outliers, combinations of them might. ◦ The last observation has a high market value ($120,700) but a relatively small house size (1,581 square feet) and may be an outlier. 39
  • 40.  Closer examination of outliers may reveal an error or a need for further investigation to determine whether the observation is relevant to the current analysis.  A conservative approach is to create two data sets, one with and one without outliers, and then construct a model on both data sets. ◦ If a model’s implications depend on the inclusion or exclusion of outliers, then one should spend additional time to track down the cause of the outliers. 40
  • 42.  Often data sets contain variables that, considered separately, are not particularly insightful but that, when combined as ratios, may represent important relationships. ◦ Example: the price/earnings (PE) ratio  A critical task is determining how to represent the measurements of the variables and which variables to consider. ◦ Example: The variable Language with the possible values of English, German, and Spanish would be replaced with three binary variables called English, German, and Spanish. 42
  • 43. v  [min, max] v’ = [(v - min)/(max - min)] * (max_new – min_new) + min_new  v’  [min_new, max_new] 43
  • 44.  XLMiner provides a Transform Categorical procedure under Transform in the Data Analysis group.  This procedure provides options to create dummy variables, create ordinal category scores, and reduce categories by combining them into similar groups. 44
  • 45. 45
  • 46. 46
  • 47. 47
  • 48.  In some cases, it may be desirable to transform a continuous variable into categories. XLMiner provides a Bin Continuous Data procedure under Transform in the Data Analysis group.  Caution: In general, transforming continuous variables into categories because causes a loss of information (a continuous variable’s category is less informative than a specific numeric value) and increases the number of variables. 48
  • 49. 49
  • 50. XLMiner calculates the interval as the (Maximum value for the x3 variable - Minimum value for the x3 variable) / #bins specified by the user or in this instance (252 - 96) / 4 which equals 39. Bin 12: Values 96 - < 135 Bin 15: Values 135 - < 174 Bin 18: Values 174 - 213 Bin 21: Values 213 - 252 50
  • 52.  Cluster analysis, also called data segmentation, is a collection of techniques that seek to group or segment a collection of objects (observations or records) into subsets or clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters. ◦ The objects within clusters should exhibit a high amount of similarity, whereas those in different clusters will be dissimilar. 52
  • 53. 53
  • 54.  Hierarchical clustering: The data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters, each containing a single object.  k-Means clustering: Given a value of k, the k-means algorithm randomly partitions the observations into k clusters. After all observations have been assigned to a cluster, the resulting cluster centroids are calculated. Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid. 54
  • 55. • Agglomerative clustering methods proceed by series of fusions of the n objects into groups  this is the method implemented in XLMiner • Divisive clustering methods separate n objects successively into finer groupings 55
  • 56.  Euclidean distance is the straight-line distance between two points  The Euclidean distance measure between two points (x1, x2, . . . , xn) and (y1, y2, . . . , yn) is 56
  • 57.  Single linkage clustering (nearest-neighbor) The distance between groups is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each group are considered. At each stage, the closest 2 clusters are merged  Complete linkage clustering The distance between groups is the distance between the most distant pair of objects, one from each group  Average linkage clustering The distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group.  Average group linkage clustering Uses the mean values for each variable to compute distances between clusters  Ward’s hierarchical clustering Uses a sum of squares criterion Different methods generally yield different results, so it is best to experiment and compare the results. 57
  • 58. 58
  • 59.  Colleges and Universities Data  Cluster the institutions using the five numeric columns in the data set.  XLMiner > Data Analysis > Cluster > Hierarchical Clustering Note: We are clustering the numerical variables, so School and Type are not included. 59
  • 60.  Check the box Normalize input data to ensure that the distance measure accords equal weight to each variable  Use the Euclidean distance as the similarity measure for numeric data.  Select the clustering method you wish to use. 60
  • 61.  Select the number of clusters (The agglomerative method of hierarchical clustering keeps forming clusters until only one cluster is left. This option lets you stop the process at a given number of clusters.)  We selected four clusters. 61
  • 63.  Dendogram illustrates the fusions or divisions made at each successive stage of analysis  A horizontal line shows the cluster partitions 63
  • 64.  Predicted clusters ◦ shows the assignment of observations to the number of clusters we specified in the input dialog, (in this case four) Cluster # Colleges 1 23 2 22 3 3 4 1 64
  • 66.  Dimension reduction - process of removing variables from the analysis without losing any crucial information. ◦ One way is to examine pairwise correlations to detect variables or groups of variables that may supply similar information. Such variables can be aggregated or removed to allow more parsimonious model development.  Dimension reduction in XLMiner ◦ Feature selection ◦ Principal components 66
  • 67.  Feature Selection attempts to identify the best subset of variables (or features) out of the available variables (or features) to be used as input to a classification or prediction method. 67
  • 68. 68
  • 69.  The Principal Components procedure can be found on the XLMiner tab under Transform in the Data Analysis group.  Principal components analysis creates a collection of metavariables (components) that are weighted sums of the original variables. These components are uncorrelated with each other and often only a few of them are needed to convey the same information as the large set of original variables. 69
  • 70. 70
  • 71.  Data reduction - breaking down large sets of data into more-manageable groups or segments that provide better insight. ◦ Data sampling ◦ Data cleaning ◦ Data transformation ◦ Data segmentation ◦ Dimension reduction      71
  • 72. 3-72