SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Exercise
Data Preparation
Modeling Example



       Business: National veterans’ organization
       Objective: From population of lapsing
                  donors, identify individuals
                  worth continued solicitation.

         Source: 1998 KDD-Cup Competition
                 via UCI KDD Archive

2
The Story

     A national veterans’ organization seeks to better target its solicitations for
      donation. By only soliciting the most likely donors, less money will be
      spent on solicitation efforts and more money will be available for
      charitable concerns.
     Solicitations involve sending a small gift to an individual together with a
      request for donation. Gifts include mailing labels and greeting cards.
     Of particular interest is the class of individuals identified as lapsing
      donors. These individuals made their most recent donation between 12
      and 24 months ago. The organization found that by predicting the
      response behavior of this group, they can use the model to rank all 3.5
      million individuals in their database.
     The current campaign refers to a greeting card mailing sent in 06/1997.
     The source of this data is the Association for Computing Machinery’s
      (ACM) 1998 KDD-Cup competition.



3
Additional Data Preparation


    The raw analysis data has been reduced for the purpose of this course. A subset of
    slightly over 19,000 records has been selected for modeling. As will be seen, this
    subset was not chosen arbitrarily. In addition, the 481 fields have been reduced to 50.




            Final Analysis Data                              Raw Analysis Data

            19,372 Records                                   95,412 Records
            50 Fields                                        481 Fields



4
Analysis Data Definition



           Donor master data
       CONTROL_NUMBER          Unique Donor ID
       MONTHS_SINCE_ORIGIN     Elapsed time since first donation
       IN_HOUSE                1=Given to In House program,
                               0=Not In House donor




5
Analysis Data Definition



           Demographic and other overlay data
       OVERLAY_SOURCE           M=Metromail, P=Polk, B=both
       DONOR_AGE                Age as of June 1997
       DONOR_GENDER             Actual or inferred gender
       PUBLISHED_PHONE          Published telephone listing
       HOME_OWNER               H=homeowner, U=unknown
       MOR_HIT                  Mail order response hit rate

6
Analysis Data Definition

        SES is a roll-up of the socio-economic field CLUSTER_CODE



             Demographic and other overlay data
       CLUSTER_CODE                        54 Socio-economic cluster codes
       SES                                 5 Socio-economic cluster codes
       INCOME_GROUP                        7 income group levels
       MED_HOUSEHOLD_INCOME                Median income in $100s
       PER_CAPITA_INCOME                   Income per capita in dollars
       WEALTH_RATING                       10 wealth rating groups

7
Analysis Data Definition



           Demographic and other overlay data
       MED_HOME_VALUE           Median home value in $100s
       PCT_OWNER_OCCUPIED       Percent owner occupied housing
       URBANICITY               U=urban, C=city, S=suburban,
                                T=town, R=rural, ?=unknown




8
Analysis Data Definition



           Census overlay data

       PCT_MALE_MILITARY         Percent male military in block
       PCT_MALE_VETERANS         Percent male veterans in block
       PCT_VIETNAM_VETERANS      Percent Vietnam veterans in block
       PCT_WWII_VETERANS         Percent WWII veterans in block




9
Analysis Data Definition



            Transaction detail data

        NUMBER_PROM_12          Number promotions last 12 mos.
        CARD_PROM_12            Number card promotions last 12 mos.



                                              97NK
                                                     Time
            `94      `95      `96       `97           `98
10
Analysis Data Definition



            Transaction detail data

        FREQ_STATUS_97NK            Frequency status, June `97
        RECENCY_STATUS_96NK         Recency status, June `96
        MONTHS_SINCE_LAST           Months since last donation
        LAST_GIFT_AMT               Amount of most recent donation
                                    96NK         97NK
                                                        Time
            `94      `95      `96          `97           `98
11
Analysis Data Definition
     The sampling method implies that no one made a donation between 6/1996 and 6/1997.
     However, for a limited number of cases, the number of months since last gift is fewer
     than 12. This contradiction is not resolved in the data’s documentation, nor will it be
     resolved here.

                     RECENT transaction detail data

             RESPONSE_PROP                        Response proportion since June `94
             RESPONSE_COUNT                       Response count since June `94
             AVG_GIFT_AMT                         Average gift amount since June `94
             RECENT_STAR_STATUS                   STAR (1, 0) status since June `94
                           94NK                       96NK
                                                                         Time
                     `94          `95           `96          `97          `98
12
Analysis Data Definition



             RECENT transaction detail data

        CARD_RESPONSE_PROP Response proportion since June `94
        CARD_RESPONSE_COUNT Response count since June `94
        CARD_AVG_GIFT_AMT   Average gift amount since June `94


                   94NK               96NK
                                                   Time
             `94          `95   `96          `97    `98
13
Analysis Data Definition



            LIFETIME transaction detail data

        PROM                    Total number promotions ever
        GIFT_COUNT              Total number donations ever
        AVG_GIFT_AMT            Overall average gift amount
        PEP_STAR                STAR status ever (1=yes, 0=no)
                  94NK               96NK
                                                  Time
            `94          `95   `96          `97    `98
14
Analysis Data Definition



            LIFETIME transaction detail data

        GIFT_AMOUNT             Total gift amount ever
        GIFT_COUNT              Total number donations ever
        MAX_GIFT                Maximum gift amount
        GIFT_RANGE              Maximum less minimum gift amount
                  94NK               96NK
                                                   Time
            `94          `95   `96          `97     `98
15
Analysis Data Definition



            KDD supplied LIFETIME transaction detail data

        FILE_AVG_GIFT           Average gift from raw data
        FILE_CARD_GIFT          Average card gift raw data
        MONTHS_SINCE_FIRST      First donation date from June `97
        MONTHS_SINCE_LAST       Last donation date from June `97
                  94NK               96NK
                                                  Time
            `94          `95   `96          `97    `98
16
Analysis Data Definition



            Transaction detail data target definition


         TARGET_B    Response to 97NK solicitation (1=yes 0=no)
         TARGET_D    Response amount to 97NK solicitation
                     (missing if no response)

                                               97NK
                                                      Time
            `94      `95       `96       `97           `98
17
Demonstration
     Data set: PVA_RAW_DATA

     Purpose:
         Get familiar with the data
     

         Basic decision modeling with tree, regression, and neural network
     



     Parameters:
         Prior probabilities: (0.05, 0.95)
     

         Profit matrix: ($14.62, -0.68)
     

         Target: TARGET_B (TARGET_D must be rejected)
     




18
Improving Regression Selection
                  60
                                        All
                                    Subsets
                  45
        Minutes




                  30
                                                   Stepwise
                  15


                  0
                       25      50             75          100
                            Number of Variables
19
Improving Input Selection
      Much of the success of a predictive model depends on input selection.
       Most input selection processes attempt to minimize input redundancy and
       maximize input relevancy.
      Selection is usually using a heuristic search because the complexity of an
       exhaustive (all subsets) search increases exponentially in the number of
       inputs.
      There exist branch-and-bound algorithms that approximate an exhaustive
       input search and run quite quickly for a reasonably small number of
       inputs. One algorithm, found in the SAS/STAT LOGISTIC procedure,
       actually runs faster than the usual forward, backward, and stepwise
       procedures.
      While the example data set in this course has fewer than 60 inputs, many
       modeling data sets do not. Given the promise of an exhaustive search, it
       would be extremely desirable to reduce the input count without
       compromising the quality of the ultimate predictive model.



20
Improving Input Selection

               Univariate Screening

               Variable Clustering

               Categorical Recoding


               All Subsets Selection

21
Input Dimension Reduction
      A three-phased approach is proposed for input dimension
       reduction in preparation for all subsets selection.
          First, a univariate screening is performed to eliminate those inputs
           with little promise of target association. This must be done with care
           to avoid eliminating inputs whose predictive value occurs only in
           conjunction with other inputs.
          Second, variable clustering techniques are used to group correlated
           interval inputs and minimize input redundancy.
          Third, enhanced weight-of-evidence methods are used to effectively
           incorporate categorical inputs into the final model.
      With the input dimension reduced, an all subsets search
       commences on the remaining inputs.




22
Univariate Screening

      In this technique, inputs are screened based on their individual
       correlation with the target and only the inputs with the highest
       correlations are kept.
      Unfortunately, this approach does not account for partial
       associations among the inputs. Inputs could be erroneously
       omitted or erroneously included. Partial associations occur when
       the effect of one input changes in the presence of another input.
      A compromise devised to minimize the dangers of partial
       associations is to use univariate screening followed by liberal
       forward selection—not as a way of finding useful inputs, but rather
       as a way to eliminate clearly useless ones.




23
R-square Selection for Univariate Screening

     The R-square selection approach has two phases.
       First, the input/target correlation is calculated for each
        input. Each input with a correlation below the minimum
        R-square setting is rejected.
       Second, a forward election is performed. The forward
        selection procedure terminates when all remaining
        inputs have a correlation below the specified stop R-
        square. These remaining inputs are also rejected.




24

Mais conteúdo relacionado

Último

AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 

Último (20)

AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 

Destaque

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Destaque (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

My Law

  • 2. Modeling Example Business: National veterans’ organization Objective: From population of lapsing donors, identify individuals worth continued solicitation. Source: 1998 KDD-Cup Competition via UCI KDD Archive 2
  • 3. The Story  A national veterans’ organization seeks to better target its solicitations for donation. By only soliciting the most likely donors, less money will be spent on solicitation efforts and more money will be available for charitable concerns.  Solicitations involve sending a small gift to an individual together with a request for donation. Gifts include mailing labels and greeting cards.  Of particular interest is the class of individuals identified as lapsing donors. These individuals made their most recent donation between 12 and 24 months ago. The organization found that by predicting the response behavior of this group, they can use the model to rank all 3.5 million individuals in their database.  The current campaign refers to a greeting card mailing sent in 06/1997.  The source of this data is the Association for Computing Machinery’s (ACM) 1998 KDD-Cup competition. 3
  • 4. Additional Data Preparation The raw analysis data has been reduced for the purpose of this course. A subset of slightly over 19,000 records has been selected for modeling. As will be seen, this subset was not chosen arbitrarily. In addition, the 481 fields have been reduced to 50. Final Analysis Data Raw Analysis Data 19,372 Records 95,412 Records 50 Fields 481 Fields 4
  • 5. Analysis Data Definition Donor master data CONTROL_NUMBER Unique Donor ID MONTHS_SINCE_ORIGIN Elapsed time since first donation IN_HOUSE 1=Given to In House program, 0=Not In House donor 5
  • 6. Analysis Data Definition Demographic and other overlay data OVERLAY_SOURCE M=Metromail, P=Polk, B=both DONOR_AGE Age as of June 1997 DONOR_GENDER Actual or inferred gender PUBLISHED_PHONE Published telephone listing HOME_OWNER H=homeowner, U=unknown MOR_HIT Mail order response hit rate 6
  • 7. Analysis Data Definition SES is a roll-up of the socio-economic field CLUSTER_CODE Demographic and other overlay data CLUSTER_CODE 54 Socio-economic cluster codes SES 5 Socio-economic cluster codes INCOME_GROUP 7 income group levels MED_HOUSEHOLD_INCOME Median income in $100s PER_CAPITA_INCOME Income per capita in dollars WEALTH_RATING 10 wealth rating groups 7
  • 8. Analysis Data Definition Demographic and other overlay data MED_HOME_VALUE Median home value in $100s PCT_OWNER_OCCUPIED Percent owner occupied housing URBANICITY U=urban, C=city, S=suburban, T=town, R=rural, ?=unknown 8
  • 9. Analysis Data Definition Census overlay data PCT_MALE_MILITARY Percent male military in block PCT_MALE_VETERANS Percent male veterans in block PCT_VIETNAM_VETERANS Percent Vietnam veterans in block PCT_WWII_VETERANS Percent WWII veterans in block 9
  • 10. Analysis Data Definition Transaction detail data NUMBER_PROM_12 Number promotions last 12 mos. CARD_PROM_12 Number card promotions last 12 mos. 97NK Time `94 `95 `96 `97 `98 10
  • 11. Analysis Data Definition Transaction detail data FREQ_STATUS_97NK Frequency status, June `97 RECENCY_STATUS_96NK Recency status, June `96 MONTHS_SINCE_LAST Months since last donation LAST_GIFT_AMT Amount of most recent donation 96NK 97NK Time `94 `95 `96 `97 `98 11
  • 12. Analysis Data Definition The sampling method implies that no one made a donation between 6/1996 and 6/1997. However, for a limited number of cases, the number of months since last gift is fewer than 12. This contradiction is not resolved in the data’s documentation, nor will it be resolved here. RECENT transaction detail data RESPONSE_PROP Response proportion since June `94 RESPONSE_COUNT Response count since June `94 AVG_GIFT_AMT Average gift amount since June `94 RECENT_STAR_STATUS STAR (1, 0) status since June `94 94NK 96NK Time `94 `95 `96 `97 `98 12
  • 13. Analysis Data Definition RECENT transaction detail data CARD_RESPONSE_PROP Response proportion since June `94 CARD_RESPONSE_COUNT Response count since June `94 CARD_AVG_GIFT_AMT Average gift amount since June `94 94NK 96NK Time `94 `95 `96 `97 `98 13
  • 14. Analysis Data Definition LIFETIME transaction detail data PROM Total number promotions ever GIFT_COUNT Total number donations ever AVG_GIFT_AMT Overall average gift amount PEP_STAR STAR status ever (1=yes, 0=no) 94NK 96NK Time `94 `95 `96 `97 `98 14
  • 15. Analysis Data Definition LIFETIME transaction detail data GIFT_AMOUNT Total gift amount ever GIFT_COUNT Total number donations ever MAX_GIFT Maximum gift amount GIFT_RANGE Maximum less minimum gift amount 94NK 96NK Time `94 `95 `96 `97 `98 15
  • 16. Analysis Data Definition KDD supplied LIFETIME transaction detail data FILE_AVG_GIFT Average gift from raw data FILE_CARD_GIFT Average card gift raw data MONTHS_SINCE_FIRST First donation date from June `97 MONTHS_SINCE_LAST Last donation date from June `97 94NK 96NK Time `94 `95 `96 `97 `98 16
  • 17. Analysis Data Definition Transaction detail data target definition TARGET_B Response to 97NK solicitation (1=yes 0=no) TARGET_D Response amount to 97NK solicitation (missing if no response) 97NK Time `94 `95 `96 `97 `98 17
  • 18. Demonstration Data set: PVA_RAW_DATA Purpose: Get familiar with the data  Basic decision modeling with tree, regression, and neural network  Parameters: Prior probabilities: (0.05, 0.95)  Profit matrix: ($14.62, -0.68)  Target: TARGET_B (TARGET_D must be rejected)  18
  • 19. Improving Regression Selection 60 All Subsets 45 Minutes 30 Stepwise 15 0 25 50 75 100 Number of Variables 19
  • 20. Improving Input Selection  Much of the success of a predictive model depends on input selection. Most input selection processes attempt to minimize input redundancy and maximize input relevancy.  Selection is usually using a heuristic search because the complexity of an exhaustive (all subsets) search increases exponentially in the number of inputs.  There exist branch-and-bound algorithms that approximate an exhaustive input search and run quite quickly for a reasonably small number of inputs. One algorithm, found in the SAS/STAT LOGISTIC procedure, actually runs faster than the usual forward, backward, and stepwise procedures.  While the example data set in this course has fewer than 60 inputs, many modeling data sets do not. Given the promise of an exhaustive search, it would be extremely desirable to reduce the input count without compromising the quality of the ultimate predictive model. 20
  • 21. Improving Input Selection Univariate Screening Variable Clustering Categorical Recoding All Subsets Selection 21
  • 22. Input Dimension Reduction  A three-phased approach is proposed for input dimension reduction in preparation for all subsets selection.  First, a univariate screening is performed to eliminate those inputs with little promise of target association. This must be done with care to avoid eliminating inputs whose predictive value occurs only in conjunction with other inputs.  Second, variable clustering techniques are used to group correlated interval inputs and minimize input redundancy.  Third, enhanced weight-of-evidence methods are used to effectively incorporate categorical inputs into the final model.  With the input dimension reduced, an all subsets search commences on the remaining inputs. 22
  • 23. Univariate Screening  In this technique, inputs are screened based on their individual correlation with the target and only the inputs with the highest correlations are kept.  Unfortunately, this approach does not account for partial associations among the inputs. Inputs could be erroneously omitted or erroneously included. Partial associations occur when the effect of one input changes in the presence of another input.  A compromise devised to minimize the dangers of partial associations is to use univariate screening followed by liberal forward selection—not as a way of finding useful inputs, but rather as a way to eliminate clearly useless ones. 23
  • 24. R-square Selection for Univariate Screening The R-square selection approach has two phases.  First, the input/target correlation is calculated for each input. Each input with a correlation below the minimum R-square setting is rejected.  Second, a forward election is performed. The forward selection procedure terminates when all remaining inputs have a correlation below the specified stop R- square. These remaining inputs are also rejected. 24