SlideShare uma empresa Scribd logo
1 de 23
Ingo Bentrott
      School of Marketing
University of Technology, Sydney
   “Vinod Shetty, of Mumbai, secretary of the newly formed
    Young Professionals Collective, said staff were subject to so
    much abuse that thousands of its workers were quitting in
    despair. The problem has become so bad that remaining
    workers are being forced to extend their shifts to 12 to 13
    hours a day to fill the gaps.

Although a call centre worker in India earns about $70 a week-
  twice as much as most professionals in a nation suffering
  chronic underemployment- up to 60 per cent leave their jobs
  each year.”
   (insert graph and logistics regression)

   If you run a logistic regression for BUY using
    the data on the left, you will get a response
    like the graphic on the right

   This is due to the well known issue of
    Listwise Deletion (LD)
   There are two types of non-response: complete non-
    response, where the person does not participate at all in
    the survey and item non-response where a survey is only
    partially completed

   Coleman (1991) mentioned that the rates of non-
    responses have remained constant but Jarvis (2002) says
    the rates are increasing when you control for answering
    machines.

   Respondents have grown to „strongly dislike‟ phone
    surveys.
    ◦ The primary concern is privacy, which has been made worse by
      well-publicized breaches in security (Jarvis 2002)

   In essence, whenever you have missing data in your data,
    you are forced to somehow address it
    ◦ Delete or Impute
   Missing data can be of three types
    ◦ Missing Completely at Random (MCAR)
      Missings are unrelated to the value of x or any other
       variable

    ◦ Missing at Random (MAR)
      Missing not a function of x when „controlled for other
       variable effects.‟

    ◦ Non-ignorable missing
      Missing caused by an unmeasured variable
   Most current discrete choice studies are using stated preference
    designs
    ◦ Creates orthogonal Xs

   This is a way to reduce the number of respondents by getting as
    much data as possible out of fewer respondents

   Discrete choice studies based on Random Utility Theory (RUT) can
    give you excellent estimation of willingness to pay estimates (WTP)
    ◦ Is necessary to have complete cases for low variance estimation

   If data is collected by same survey instrument, it is likely to have the
    same missing pattern across the Xs (Howell, 1998).

   Revealed Preference (RP) data usually has multicollinearity issues
    and the use of missing data indicators will exacerbate this issue.
   (insert graph)

   From our example a bit ago, using most
    multiple imputation techniques would still
    have problems imputing a value for USER
    RATING above.

   If the only variables that can be used are AGE,
    INCOME and POST CODE, missings would be
    a linear combination of these
   Many statistics packages use Listwise Deletion (LD) by default when
    estimating a discrete choice model.
    ◦ In SEM models, VAR-COV matrix only uses valid data for
      estimation

   Leads to selection bias and estimates with reduced efficiency

   If data is MCAR, only penalty is loss of power

   Mean Imputation takes multiple imputes to the same data point and
    averages the results
    ◦ MI is a main-effects only model, CART/MARS use interactions so
      we may not need multiple imputes

   “Hot Deck” imputation (Little and Rubin, 1987) is a technique when
    you use values based on similar cases (similar to surrogates in
    CART)
   Expected Maximization (EM) has been successfully
    applied to missing data but standard errors must be
    obtained using auxiliary methods.
    ◦ Missing imputed during EM

   FIML and ML methods assume multivariate normality
    ◦ These techniques are best when there are a few, distinct
      patterns of missing data (Little, Schnabel, Baumert, 2000).

   If the data is MAR and not MCAR all the above
    techniques will be biased
    ◦ Since MAR implies another „observed‟ explanatory variable
      is affecting the missing, interactions in CART/MARS can
      pick this up.
   Most missing data tends to act in combination (Borgoni
    and Berrington, 2004)

   We should not try to “break” the multivariate nature of the
    data.
    ◦ CART uses surrogates, so even though we impute data one
      variable at a time, the structure will be preserved.

   Most imputation techniques assume multivariate normal.

   Imputation sometimes assumes data is MCAR but if the
    data has high degree of interactions and non-monotonic,
    CART, by its nature will perform better on data that is MAR

   EM algorithm has been proven to be good but implies
    missings only during estimation
    ◦ CART technique can fill the dataset for later analysis.
   If data has high dimensionality and data sparseness, univariate
    nature of CART will be better able to handle this than Multiple
    Imputation using regression.

   Trees are also less prone to outliers and misspecified models

   Although a multiple iteration tree is shown to be better in Monte
    Carlo studies by using multiple draws from CARTs conditional
    distribution (Borgoni and Berrington, 2004), the results are within a
    standard error of the “one shot” variable at a time CART imputation
    technique.
    ◦ One shot has some added variability (like other techniques) but
      standard errors may be underestimated.
    ◦ Extra information gathered from imputation may offset extra
      variability

   If the data is MCAR, using a simple Pearson Chi Square test of
    Observed versus Expected values validates the imputed values.
   (insert table of Descriptive Statistics)

   The diagnostic, binary-valued variable investigated is whether
    the patient shows signs of diabetes according to World Health
    Organization criteria (i.e., if the 2 hour post-load plasma
    glucose was at least 200 mg/dl at any survey examination or
    if found during routine medical care). The population lives
    near Phoenix, Arizona, USA.
   (insert table of Descriptive Statistics)

   This is a dataset with information about renters
    and homeowners. The dataset is a good mixture of
    categorical and continuous variables with a lot of
    missing data.
   This survey is aimed at gathering some
    information about your preferences for athletic
    shoes. More specifically, the product in question
    is an athletic shoe that is to be used primarily for
    playing a sport (or several sports). For example,
    the shoes could be used for playing basketball,
    tennis, running, hiking, and so on.

   Since the questions asked are from a balanced
    stated preference (SP) design, there are only
    missing values in the demographic questions
   (insert table on Descriptive Statistics)
   This presentation looks at 5 different modeling techniques on
    the 3 datasets mentioned previously.

   Model 1. The first model was a simple logistic regression using
    all variables
    ◦ No transformations
    ◦ Listwise deletion was used for missing values

   Model 2. A MARS model was then run with main effects only and
    all model defaults
    ◦ Since the data is binary, this is a Linear Probability Model (LPM)

   Model 3. Mean imputation was used in a logit model

   Model 4. MARS basis functions were then put into logistic
    regression to recover standard errors and eliminate the need for
    weighted least squares in LPM
   Step 1. Sort the variables with missing values from least to
    worst

   Step 2. Starting with the least missing variable, partition
    the data into one data set with that variable‟s missing
    values and one data set with complete cases

   Step 3. Estimate a tree with the least missing variable as a
    target

   Step 4. Score the data set with missing values from the
    results in step 3

   Step 5. Repeat for the next affected variable until all data
    is filled
   (insert graph)

   Regression by logit will yield a different shape
    than a linear probability model

   Some cases will be classified differently using
    the same basis functions from MARS
   (insert table)
   (insert table)
   The data on Shoe buyers is “real” in that it was an
    SP study that was deployed

   The nature of orthogonal design forced trade offs
    and controls for interactions

   The Pima Indian and Home Owner dataset are
    well known and has well defined patterns
    amongst the Xs

   If the buyers are the class of interest, a
    CART/MARS imputation is clearly preferred
   CART and MARS will perform better on mixed data types and
    should be the preferred imputation modeling technique
    ◦ Possible CART MARS  Logit technique to capture all possible non-
      monotonics

   Web based surveys allow us to see when people quit survey

   Can investigate if the person looked at all questions and refused
    some
    ◦ In mail surveys, this is impossible
    ◦ The web will expand our missing data categories as a complete survey,
      means someone that viewed and answered all the questions (Bosnjak and
      Tuten, 2001)

   If survey respondents are paid, this still works best for reducing
    non-response
    ◦ CART can be used with ROC/Lifts charts to see what is optimal amount of
      payment per completed survey
    ◦ Many companies would be willing to pay for this completeness (Coleman,
      1991)
Athletic Shoe Preferences Survey Data Analysis

Mais conteúdo relacionado

Mais procurados

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisgokulprasath06
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and StatisticsT.S. Lim
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSKAMIL MAJEED
 
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...Paul Richards
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
Exploratory data analysis project
Exploratory data analysis project Exploratory data analysis project
Exploratory data analysis project BabatundeSogunro
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Mohammed Musah
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsHarsh Parekh
 
Types of variables and descriptive statistics
Types of variables and descriptive statisticsTypes of variables and descriptive statistics
Types of variables and descriptive statisticsDhritiman Chakrabarti
 
Final generalized linear modeling by idrees waris iugc
Final generalized linear modeling by idrees waris iugcFinal generalized linear modeling by idrees waris iugc
Final generalized linear modeling by idrees waris iugcId'rees Waris
 
Lecture slides stats1.13.l12.air
Lecture slides stats1.13.l12.airLecture slides stats1.13.l12.air
Lecture slides stats1.13.l12.airatutor_te
 
Statistical Analysis Overview
Statistical Analysis OverviewStatistical Analysis Overview
Statistical Analysis OverviewEcumene
 
Statistical analysis and interpretation
Statistical analysis and interpretationStatistical analysis and interpretation
Statistical analysis and interpretationDave Marcial
 

Mais procurados (17)

Data analysis
Data analysisData analysis
Data analysis
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Eda sri
Eda sriEda sri
Eda sri
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
 
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Exploratory data analysis project
Exploratory data analysis project Exploratory data analysis project
Exploratory data analysis project
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
 
Burns And Bush Chapter 15
Burns And Bush Chapter 15Burns And Bush Chapter 15
Burns And Bush Chapter 15
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data Analytics
 
Statistics
StatisticsStatistics
Statistics
 
Types of variables and descriptive statistics
Types of variables and descriptive statisticsTypes of variables and descriptive statistics
Types of variables and descriptive statistics
 
Final generalized linear modeling by idrees waris iugc
Final generalized linear modeling by idrees waris iugcFinal generalized linear modeling by idrees waris iugc
Final generalized linear modeling by idrees waris iugc
 
Lecture slides stats1.13.l12.air
Lecture slides stats1.13.l12.airLecture slides stats1.13.l12.air
Lecture slides stats1.13.l12.air
 
Statistical Analysis Overview
Statistical Analysis OverviewStatistical Analysis Overview
Statistical Analysis Overview
 
Statistical analysis and interpretation
Statistical analysis and interpretationStatistical analysis and interpretation
Statistical analysis and interpretation
 

Semelhante a Athletic Shoe Preferences Survey Data Analysis

A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...CSCJournals
 
Anomaly detection- Credit Card Fraud Detection
Anomaly detection- Credit Card Fraud DetectionAnomaly detection- Credit Card Fraud Detection
Anomaly detection- Credit Card Fraud DetectionLipsa Panda
 
Twala2007.doc
Twala2007.docTwala2007.doc
Twala2007.docbutest
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challengesijcnes
 
Handling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingHandling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingIRJET Journal
 
Confidence in Software Cost Estimation Results based on MMRE and PRED
Confidence in Software Cost Estimation Results based on MMRE and PREDConfidence in Software Cost Estimation Results based on MMRE and PRED
Confidence in Software Cost Estimation Results based on MMRE and PREDgregoryg
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputationLeonardo Auslender
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionLalit Jain
 
Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man'sLeonardo Auslender
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdfLellaLinton
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast ReviewAhmad Ali Abin
 
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...butest
 
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...butest
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET Journal
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCE
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCEA SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCE
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCEIJCI JOURNAL
 
data mining 2009 fcsm.pdf
data mining 2009 fcsm.pdfdata mining 2009 fcsm.pdf
data mining 2009 fcsm.pdfLnuRitika
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industryskewdlogix
 

Semelhante a Athletic Shoe Preferences Survey Data Analysis (20)

A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
 
Anomaly detection- Credit Card Fraud Detection
Anomaly detection- Credit Card Fraud DetectionAnomaly detection- Credit Card Fraud Detection
Anomaly detection- Credit Card Fraud Detection
 
Twala2007.doc
Twala2007.docTwala2007.doc
Twala2007.doc
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 
Handling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingHandling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random Undersampling
 
Confidence in Software Cost Estimation Results based on MMRE and PRED
Confidence in Software Cost Estimation Results based on MMRE and PREDConfidence in Software Cost Estimation Results based on MMRE and PRED
Confidence in Software Cost Estimation Results based on MMRE and PRED
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputation
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly Detection
 
Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man's
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
 
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCE
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCEA SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCE
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCE
 
data mining 2009 fcsm.pdf
data mining 2009 fcsm.pdfdata mining 2009 fcsm.pdf
data mining 2009 fcsm.pdf
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industry
 

Mais de Salford Systems

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4Salford Systems
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Salford Systems
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningSalford Systems
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerSalford Systems
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To RememberSalford Systems
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetSalford Systems
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher EducationSalford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSalford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPMSalford Systems
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7Salford Systems
 

Mais de Salford Systems (20)

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 

Último

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Último (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Athletic Shoe Preferences Survey Data Analysis

  • 1. Ingo Bentrott School of Marketing University of Technology, Sydney
  • 2. “Vinod Shetty, of Mumbai, secretary of the newly formed Young Professionals Collective, said staff were subject to so much abuse that thousands of its workers were quitting in despair. The problem has become so bad that remaining workers are being forced to extend their shifts to 12 to 13 hours a day to fill the gaps. Although a call centre worker in India earns about $70 a week- twice as much as most professionals in a nation suffering chronic underemployment- up to 60 per cent leave their jobs each year.”
  • 3. (insert graph and logistics regression)  If you run a logistic regression for BUY using the data on the left, you will get a response like the graphic on the right  This is due to the well known issue of Listwise Deletion (LD)
  • 4. There are two types of non-response: complete non- response, where the person does not participate at all in the survey and item non-response where a survey is only partially completed  Coleman (1991) mentioned that the rates of non- responses have remained constant but Jarvis (2002) says the rates are increasing when you control for answering machines.  Respondents have grown to „strongly dislike‟ phone surveys. ◦ The primary concern is privacy, which has been made worse by well-publicized breaches in security (Jarvis 2002)  In essence, whenever you have missing data in your data, you are forced to somehow address it ◦ Delete or Impute
  • 5. Missing data can be of three types ◦ Missing Completely at Random (MCAR)  Missings are unrelated to the value of x or any other variable ◦ Missing at Random (MAR)  Missing not a function of x when „controlled for other variable effects.‟ ◦ Non-ignorable missing  Missing caused by an unmeasured variable
  • 6. Most current discrete choice studies are using stated preference designs ◦ Creates orthogonal Xs  This is a way to reduce the number of respondents by getting as much data as possible out of fewer respondents  Discrete choice studies based on Random Utility Theory (RUT) can give you excellent estimation of willingness to pay estimates (WTP) ◦ Is necessary to have complete cases for low variance estimation  If data is collected by same survey instrument, it is likely to have the same missing pattern across the Xs (Howell, 1998).  Revealed Preference (RP) data usually has multicollinearity issues and the use of missing data indicators will exacerbate this issue.
  • 7. (insert graph)  From our example a bit ago, using most multiple imputation techniques would still have problems imputing a value for USER RATING above.  If the only variables that can be used are AGE, INCOME and POST CODE, missings would be a linear combination of these
  • 8. Many statistics packages use Listwise Deletion (LD) by default when estimating a discrete choice model. ◦ In SEM models, VAR-COV matrix only uses valid data for estimation  Leads to selection bias and estimates with reduced efficiency  If data is MCAR, only penalty is loss of power  Mean Imputation takes multiple imputes to the same data point and averages the results ◦ MI is a main-effects only model, CART/MARS use interactions so we may not need multiple imputes  “Hot Deck” imputation (Little and Rubin, 1987) is a technique when you use values based on similar cases (similar to surrogates in CART)
  • 9. Expected Maximization (EM) has been successfully applied to missing data but standard errors must be obtained using auxiliary methods. ◦ Missing imputed during EM  FIML and ML methods assume multivariate normality ◦ These techniques are best when there are a few, distinct patterns of missing data (Little, Schnabel, Baumert, 2000).  If the data is MAR and not MCAR all the above techniques will be biased ◦ Since MAR implies another „observed‟ explanatory variable is affecting the missing, interactions in CART/MARS can pick this up.
  • 10. Most missing data tends to act in combination (Borgoni and Berrington, 2004)  We should not try to “break” the multivariate nature of the data. ◦ CART uses surrogates, so even though we impute data one variable at a time, the structure will be preserved.  Most imputation techniques assume multivariate normal.  Imputation sometimes assumes data is MCAR but if the data has high degree of interactions and non-monotonic, CART, by its nature will perform better on data that is MAR  EM algorithm has been proven to be good but implies missings only during estimation ◦ CART technique can fill the dataset for later analysis.
  • 11. If data has high dimensionality and data sparseness, univariate nature of CART will be better able to handle this than Multiple Imputation using regression.  Trees are also less prone to outliers and misspecified models  Although a multiple iteration tree is shown to be better in Monte Carlo studies by using multiple draws from CARTs conditional distribution (Borgoni and Berrington, 2004), the results are within a standard error of the “one shot” variable at a time CART imputation technique. ◦ One shot has some added variability (like other techniques) but standard errors may be underestimated. ◦ Extra information gathered from imputation may offset extra variability  If the data is MCAR, using a simple Pearson Chi Square test of Observed versus Expected values validates the imputed values.
  • 12. (insert table of Descriptive Statistics)  The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.
  • 13. (insert table of Descriptive Statistics)  This is a dataset with information about renters and homeowners. The dataset is a good mixture of categorical and continuous variables with a lot of missing data.
  • 14. This survey is aimed at gathering some information about your preferences for athletic shoes. More specifically, the product in question is an athletic shoe that is to be used primarily for playing a sport (or several sports). For example, the shoes could be used for playing basketball, tennis, running, hiking, and so on.  Since the questions asked are from a balanced stated preference (SP) design, there are only missing values in the demographic questions
  • 15. (insert table on Descriptive Statistics)
  • 16. This presentation looks at 5 different modeling techniques on the 3 datasets mentioned previously.  Model 1. The first model was a simple logistic regression using all variables ◦ No transformations ◦ Listwise deletion was used for missing values  Model 2. A MARS model was then run with main effects only and all model defaults ◦ Since the data is binary, this is a Linear Probability Model (LPM)  Model 3. Mean imputation was used in a logit model  Model 4. MARS basis functions were then put into logistic regression to recover standard errors and eliminate the need for weighted least squares in LPM
  • 17. Step 1. Sort the variables with missing values from least to worst  Step 2. Starting with the least missing variable, partition the data into one data set with that variable‟s missing values and one data set with complete cases  Step 3. Estimate a tree with the least missing variable as a target  Step 4. Score the data set with missing values from the results in step 3  Step 5. Repeat for the next affected variable until all data is filled
  • 18. (insert graph)  Regression by logit will yield a different shape than a linear probability model  Some cases will be classified differently using the same basis functions from MARS
  • 19. (insert table)
  • 20. (insert table)
  • 21. The data on Shoe buyers is “real” in that it was an SP study that was deployed  The nature of orthogonal design forced trade offs and controls for interactions  The Pima Indian and Home Owner dataset are well known and has well defined patterns amongst the Xs  If the buyers are the class of interest, a CART/MARS imputation is clearly preferred
  • 22. CART and MARS will perform better on mixed data types and should be the preferred imputation modeling technique ◦ Possible CART MARS  Logit technique to capture all possible non- monotonics  Web based surveys allow us to see when people quit survey  Can investigate if the person looked at all questions and refused some ◦ In mail surveys, this is impossible ◦ The web will expand our missing data categories as a complete survey, means someone that viewed and answered all the questions (Bosnjak and Tuten, 2001)  If survey respondents are paid, this still works best for reducing non-response ◦ CART can be used with ROC/Lifts charts to see what is optimal amount of payment per completed survey ◦ Many companies would be willing to pay for this completeness (Coleman, 1991)