2. Modeling Example
Business: National veterans’ organization
Objective: From population of lapsing
donors, identify individuals
worth continued solicitation.
Source: 1998 KDD-Cup Competition
via UCI KDD Archive
2
3. The Story
A national veterans’ organization seeks to better target its solicitations for
donation. By only soliciting the most likely donors, less money will be
spent on solicitation efforts and more money will be available for
charitable concerns.
Solicitations involve sending a small gift to an individual together with a
request for donation. Gifts include mailing labels and greeting cards.
Of particular interest is the class of individuals identified as lapsing
donors. These individuals made their most recent donation between 12
and 24 months ago. The organization found that by predicting the
response behavior of this group, they can use the model to rank all 3.5
million individuals in their database.
The current campaign refers to a greeting card mailing sent in 06/1997.
The source of this data is the Association for Computing Machinery’s
(ACM) 1998 KDD-Cup competition.
3
4. Additional Data Preparation
The raw analysis data has been reduced for the purpose of this course. A subset of
slightly over 19,000 records has been selected for modeling. As will be seen, this
subset was not chosen arbitrarily. In addition, the 481 fields have been reduced to 50.
Final Analysis Data Raw Analysis Data
19,372 Records 95,412 Records
50 Fields 481 Fields
4
5. Analysis Data Definition
Donor master data
CONTROL_NUMBER Unique Donor ID
MONTHS_SINCE_ORIGIN Elapsed time since first donation
IN_HOUSE 1=Given to In House program,
0=Not In House donor
5
6. Analysis Data Definition
Demographic and other overlay data
OVERLAY_SOURCE M=Metromail, P=Polk, B=both
DONOR_AGE Age as of June 1997
DONOR_GENDER Actual or inferred gender
PUBLISHED_PHONE Published telephone listing
HOME_OWNER H=homeowner, U=unknown
MOR_HIT Mail order response hit rate
6
7. Analysis Data Definition
SES is a roll-up of the socio-economic field CLUSTER_CODE
Demographic and other overlay data
CLUSTER_CODE 54 Socio-economic cluster codes
SES 5 Socio-economic cluster codes
INCOME_GROUP 7 income group levels
MED_HOUSEHOLD_INCOME Median income in $100s
PER_CAPITA_INCOME Income per capita in dollars
WEALTH_RATING 10 wealth rating groups
7
8. Analysis Data Definition
Demographic and other overlay data
MED_HOME_VALUE Median home value in $100s
PCT_OWNER_OCCUPIED Percent owner occupied housing
URBANICITY U=urban, C=city, S=suburban,
T=town, R=rural, ?=unknown
8
9. Analysis Data Definition
Census overlay data
PCT_MALE_MILITARY Percent male military in block
PCT_MALE_VETERANS Percent male veterans in block
PCT_VIETNAM_VETERANS Percent Vietnam veterans in block
PCT_WWII_VETERANS Percent WWII veterans in block
9
10. Analysis Data Definition
Transaction detail data
NUMBER_PROM_12 Number promotions last 12 mos.
CARD_PROM_12 Number card promotions last 12 mos.
97NK
Time
`94 `95 `96 `97 `98
10
11. Analysis Data Definition
Transaction detail data
FREQ_STATUS_97NK Frequency status, June `97
RECENCY_STATUS_96NK Recency status, June `96
MONTHS_SINCE_LAST Months since last donation
LAST_GIFT_AMT Amount of most recent donation
96NK 97NK
Time
`94 `95 `96 `97 `98
11
12. Analysis Data Definition
The sampling method implies that no one made a donation between 6/1996 and 6/1997.
However, for a limited number of cases, the number of months since last gift is fewer
than 12. This contradiction is not resolved in the data’s documentation, nor will it be
resolved here.
RECENT transaction detail data
RESPONSE_PROP Response proportion since June `94
RESPONSE_COUNT Response count since June `94
AVG_GIFT_AMT Average gift amount since June `94
RECENT_STAR_STATUS STAR (1, 0) status since June `94
94NK 96NK
Time
`94 `95 `96 `97 `98
12
13. Analysis Data Definition
RECENT transaction detail data
CARD_RESPONSE_PROP Response proportion since June `94
CARD_RESPONSE_COUNT Response count since June `94
CARD_AVG_GIFT_AMT Average gift amount since June `94
94NK 96NK
Time
`94 `95 `96 `97 `98
13
14. Analysis Data Definition
LIFETIME transaction detail data
PROM Total number promotions ever
GIFT_COUNT Total number donations ever
AVG_GIFT_AMT Overall average gift amount
PEP_STAR STAR status ever (1=yes, 0=no)
94NK 96NK
Time
`94 `95 `96 `97 `98
14
15. Analysis Data Definition
LIFETIME transaction detail data
GIFT_AMOUNT Total gift amount ever
GIFT_COUNT Total number donations ever
MAX_GIFT Maximum gift amount
GIFT_RANGE Maximum less minimum gift amount
94NK 96NK
Time
`94 `95 `96 `97 `98
15
16. Analysis Data Definition
KDD supplied LIFETIME transaction detail data
FILE_AVG_GIFT Average gift from raw data
FILE_CARD_GIFT Average card gift raw data
MONTHS_SINCE_FIRST First donation date from June `97
MONTHS_SINCE_LAST Last donation date from June `97
94NK 96NK
Time
`94 `95 `96 `97 `98
16
17. Analysis Data Definition
Transaction detail data target definition
TARGET_B Response to 97NK solicitation (1=yes 0=no)
TARGET_D Response amount to 97NK solicitation
(missing if no response)
97NK
Time
`94 `95 `96 `97 `98
17
18. Demonstration
Data set: PVA_RAW_DATA
Purpose:
Get familiar with the data
Basic decision modeling with tree, regression, and neural network
Parameters:
Prior probabilities: (0.05, 0.95)
Profit matrix: ($14.62, -0.68)
Target: TARGET_B (TARGET_D must be rejected)
18
20. Improving Input Selection
Much of the success of a predictive model depends on input selection.
Most input selection processes attempt to minimize input redundancy and
maximize input relevancy.
Selection is usually using a heuristic search because the complexity of an
exhaustive (all subsets) search increases exponentially in the number of
inputs.
There exist branch-and-bound algorithms that approximate an exhaustive
input search and run quite quickly for a reasonably small number of
inputs. One algorithm, found in the SAS/STAT LOGISTIC procedure,
actually runs faster than the usual forward, backward, and stepwise
procedures.
While the example data set in this course has fewer than 60 inputs, many
modeling data sets do not. Given the promise of an exhaustive search, it
would be extremely desirable to reduce the input count without
compromising the quality of the ultimate predictive model.
20
22. Input Dimension Reduction
A three-phased approach is proposed for input dimension
reduction in preparation for all subsets selection.
First, a univariate screening is performed to eliminate those inputs
with little promise of target association. This must be done with care
to avoid eliminating inputs whose predictive value occurs only in
conjunction with other inputs.
Second, variable clustering techniques are used to group correlated
interval inputs and minimize input redundancy.
Third, enhanced weight-of-evidence methods are used to effectively
incorporate categorical inputs into the final model.
With the input dimension reduced, an all subsets search
commences on the remaining inputs.
22
23. Univariate Screening
In this technique, inputs are screened based on their individual
correlation with the target and only the inputs with the highest
correlations are kept.
Unfortunately, this approach does not account for partial
associations among the inputs. Inputs could be erroneously
omitted or erroneously included. Partial associations occur when
the effect of one input changes in the presence of another input.
A compromise devised to minimize the dangers of partial
associations is to use univariate screening followed by liberal
forward selection—not as a way of finding useful inputs, but rather
as a way to eliminate clearly useless ones.
23
24. R-square Selection for Univariate Screening
The R-square selection approach has two phases.
First, the input/target correlation is calculated for each
input. Each input with a correlation below the minimum
R-square setting is rejected.
Second, a forward election is performed. The forward
selection procedure terminates when all remaining
inputs have a correlation below the specified stop R-
square. These remaining inputs are also rejected.
24