21. Getting intimate with our target groups
ID construction of adolescents in a
digitalised world
Social groups as an extension of
psychographic segments
Role of brands in ID construction within
social groups
23. Traditional
ethnography
context
Participant
R
Hawthorne effect
Researcher
gaze
Time & Cost
intensive
24. From traditional to visual ethnography
Traditional 360° visual
ethnography ethnography
context context
Participant Participant
R
R
Contact with researcher
Hawthorne effect
via 2.0 tools
Researcher “Informer” gaze: what is
gaze important?
Time & Cost Follow multiple participants from anywhere
intensive
25. 360° Ethnography
1.User generated MM ethnography
= Observation takes place via photos/video
taken by the participants to the study
= Participants observe their own
environment & report back to the
researcher
26.
27. Ethnographic blog: identity related tasks
Non me Aspirational
Pictures of... me
Clothes you no Pictures of...
longer want to Other groups
wear Favorite clothes
Youngsters where youngsters where
you so not want to you would like to Pictures of other groups that
be friend with Personal me be friend with are different but okay
Pictures of...
Place where you can Pictures of youth of today
really be yourself
Clothes you wear at
home
Pictures of normal persons
Objects that are
typical for me
Social me
Pictures of...
Important persons
in your life
Clothes you wear
to go out
Groups of
youngsters you like
28. 360° Ethnography
1.User generated ethnography
= Observation takes place via photos/video taken
by the participants to the study
= Participants observer their own environment &
report back to the researcher
2.Nethnography: social life of teenagers is very
much MOVING ONLINE !!!
= observation of the online behaviour & content
of a target group or within a certain webspace
30. Social Network Sites: nethnography
Social
Personal
identity
identity
Nicknam
e
Conversation
Clan member Profile Monitoring on
ship text guestbook
Profile
picture
Photo
collection
31. And for the datamining freaks among you, Annelies ripped the internet
• 300 active participants of netlog randomly
selected. Equal spread age x gender
• Webcrawlers to „scan‟ pages of netlog and
substract content.
• Textmining: profile pages – photo tags –
clan membership – conversations on the
guestbook
38. Social Networks are not so social as you think they are
Only 32% of the conversations on the guestbook
are „interactive‟!
Food The rest are all single statements. kids
Shoes Litte
Mobile phones Cars
Quarrel
Miss you
Making fun
Confirm friendship
Travel
Music Welcome by strangers Transport
It was great
Feedback on picture I’ am bored
Youth movement Alcohol Sleeping
How are you?
Practical appointment School
Feedback on profiles
MSN Festivals
Congratulations Family Gaming
Age
Food
Party Sport
Online movies Express love
Study
Tokio hotel
39. one of the 3 key research questions
Social groups among today‟s youngsters
41. Methodology
M
E
Non
group Aspira
Social tional
Group Group
Differe
nt but
OK
42. REAL WORLD
WE ME (I’m better)
CHANGE
Skater
Alternative
Fashion girl
Fashion boy
Tektonic Rapper
THINK
Rockers
DRINK
Hippies
MAINSTREA
M Breezer sluts
Punk
Jumpers
emo
Nerds
Gothic
Geek girls
CONSERVATISM
43. spartacus121 best pk (runescape)
30/04/2008 19:07:55 door laurent
ik hit iets in de 27 en dat is al zot hoog zij hits zijn
gestoord lol. zen zwaard is 140M (ik heb 10 M)
44. REAL WORLD
WE ME (I’m better)
CHANGE
SKILLS SKILLS SKILLS SKILLS
SKILLS SKILLS
SKILLS Skater
Alternative SKILLS
SKILLS Fashion girl
SKILLS Fashion boy
SKILLS
SKILLS SKILLS
SKILLS SKILLS
SKILLS Tektonic
SKILLS Rapper
THINK
Rockers SKILLS
DRINK
SKILLS
Hippies
Breezer sluts
SKILLS SKILLS SKILLS SKILLS
Punk
SKILLS SKILLS
SKILLS SKILLS
Jumpers
SKILLS
emo
SKILLS
SKILLS
SKILLS SKILLS
SKILLS SKILLS
Gothic
SKILLS
Nerds
SKILLS SKILLS
SKILLS
Geek girls
CONSERVATISM
45. REAL WORLD
WE ME (I’m better)
CHANGE
LOOKS LOOKS LOOKS LOOKS
LOOKS LOOKS
Skater LOOKS
Alternative LOOKS
LOOKS Fashion girl
LOOKS Fashion boy
LOOKS
LOOKS LOOKS
LOOKS LOOKS
Tektonic LOOKS
LOOKSRapper
THINK
Rockers LOOKS
DRINK
LOOKS
Hippies
LOOKS Breezer sluts
LOOKS LOOKS LOOKS
Punk LOOKS LOOKS
LOOKS LOOKS
LOOKS Jumpers
LOOKS
emo
LOOKS
LOOKS LOOKS
LOOKS LOOKS
LOOKS Nerds
LOOKSGothic LOOKS
LOOKS
Geek girls
CONSERVATISM
50. 6 changes that rocked our socks off
• The tools for ID construction have changed
• New online quali methods proved to be efficient
• Our target group = the new and better quali researchers
• New reflection kit for entire MTV Networks staff
• Closer connection with MTV & new clients
• Redefine content strategy of TMF on screen & on line
55. Our Mission | Boosting your customer value
4C Consulting
helps companies
win, keep and grow
customer value
56. Our Solutions | Call us for…
Business
Customer Value requirements
Strategy definition
Package selection
Customer & implementation
Insight
Process Post-launch
Excellence care
57. Our Focus | Boosting your customer value
Increase Revenue Reduce Costs
1. Acquire new customers 1. Efficient Delivery (Process
Excellence)
2. Sell more to existing customers
• More of the same (increase turnover) 2. Align value propositions
• Expand value proposition portfolio
(cross-sell & product development *)
3. Advanced pricing
• Upgrade value
proposition (Upsell)
3. Prevent existing
customers from leaving
58. Business Intelligence Practice
SCV ROMSCCI Competing on
Analytics
Infrastruc-
tuur
Audit Exploitatie Coaching
Data Quality
59. Why 4C Consulting | 7 compelling reasons
1. 100% focus on customer value
management
2. Result-oriented project approach
3. Connecting marketing, sales &
customer care with senior management
and IT
4. Independent consultant for 10 years
5. Experienced crew, passionate about
marketing, sales & customer care
6. Value based pricing model
7. Satisfied & loyal customers: 90
customers, more than 380 projects
62. Business & Decision Benelux
Founded in 2002
Merger of several companies specialised
in BI Business & Decision Benelux is :
Consulting & System Integrator :
• a multi-specialist, in specific
- More than 300 consultants technology fields :
- About 18 mio Euro turnover in 2007 • Business Intelligence
- 58% organic growth comparing to 2006 • Customer Relationship Management
• Life sciences
- Last acquisition : BnV Group (BE+NL)
• Risk & Compliance
Turnover evolution (consolidated)
25000
Belgium Luxembourg Netherlands
• with foreign offices in Brussels,
Thousands
Amsterdam, Luxembourg
20000
15000 • Top accounts in finance, pharma,
10000 telco, distribution, industry (Fortis, ING,
ABN Amro, Dexia, GSK, UCB, Proximus,
5000
Belgacom, Carrefour, Honda,…)
0
2004 2005 2006 2007 Obj 2008
67. Febelmar mission
Development and promotion of market
research and opinion polls in Belgium
Protecting the sector interests
Watching over correct use of deontological
rules of market research in all phases of the
market research process
Stimulating continuous improvement of
quality of service in market research
Being a platform for communication,
exchange of expertise and networking
71. What is I4BI?
i4bi is specialized in implementing Business Intelligence Solutions in
your company.
Our team of BI experts has deep functional and technical experience
with Application development, Business Model definitions and Data
Warehousing. Functional/technical designs, development, application
role out, training etc… are phases in a project where our consultants
have many years of experience.
i4bi consultants have a deep knowledge of the Oracle Business
Intelligence products and solutions.
i4bi sponsors the development of an independent analytical branch,
which will probably see the light in 2009
72. What do we provide?
Strategic Decision Making
To Support
Business Technical Analytical
Experience Abilities Expertise
We Provide
73. Expertise
• Analytical Expertise
• Data Mining
• Statistical modelling
• Predictive analysis
• Basel II compliant modelling
• Forecasting
• Business Analysis Expertise
• Reporting
• Delivering Business Insight to decision makers
• Marketing Analysis
• Data Quality
• Use of technical tools such as SAS – SPSS – Statistica to
support & extend business knowledge
74. Contact
• For more general information:
www.I4BI.be
• For more analytical information:
Filip.deroover@I4BI.be
76. We believe ...
in the empowered consumer
Human-to-human interactions are more powerful than
ever and can make or break your brand
“Consumers are beginning in a very
real sense to own our brands and
participate. We need to begin to
learn how to let go”
A.G. Lafley, CEO & Chairman of P&G
77. We believe ...
in giving back
Rewarding experiences for participants
Active involvement of panel members
Charity contributions
78. We believe ...
in connecting
Everything we do is aimed at strengthening connections
between you, your market and us
“Connected Research” brings you closer to your market
and taps into the wisdom of the crowds
Some of our connected research methods
Research communities
Bulletin boards
Blog research
Online discussion groups
More information: http://connectedresearch.insites.eu/
79. We believe ...
in the power of new research methods
for better marketing decision making
Informational
Providing more depth to research insights
Transformational
Doing things that were previously not possible
Automational
Conducting research more efficiently
80. We believe ...
in 1 + 1 = 3
Old and new methods need to be optimally “fused” in
order to fully grasp the new customer / consumer reality
81. We believe ...
in the power of our team
People make the difference
Open, forward thinking, dedicated, passionate
Specific knowledge centers
83. About Keyrus (Belgium)
• founded in 1996 as SOLIDPartners
• focus on performance management, business
intelligence & data warehousing
• strong and balanced client base spread over different
industries
• +100 consultants specialised in both technical and
business domains
• part of Keyrus group (France)
84. Keyrus‟ global footprint
head office in Paris
present in 9 countries
+1300 employees
listed on Paris stock
exchange Euronext
85. Vision & mission
Keyrus will be one of the few leading service
providers
in the area of performance management.
We help our clients to effectively design,
build and operate
the adequate performance management
organization and solutions
in an integrated end-to-end fashion.
86. Portfolio of solutions & services
Information Business Analytic People and Corporate
Management Intelligence Applications Processes Performance
Platforms Management
IM BIP AA P&P (C)PM
BI Layer (reports, OLAP,
dashboards, alerts)
outflow functions
source systems
data warehouse
& applications
data delivery &
management
information
& data marts
functions
CPM data delivery, exchange
& synchronization
CPM Applications (e.g. Analytic Applications
planning, ABM, PA) (e.g. data mining)
87. Contact us
Keyrus nv
info@keyrus.be
Nijverheidslaan 3/2
B-1853 Strombeek-Bever
t +32 2 706 03 00
f +32 2 706 03 09
www.keyrus.be
performance management consulting technology 17-Dec-2008
89. Who is PROFACTS ?
We are „the new kids on the block‟
in (online) market research ...
90. 1 REVEALING FACTORS FOR SUCCESS
strategy
200% 286 people are nowyrs
growth rate working mean age
@ Profacts
2
people have
founded Profacts
Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
2006 2007 2007 2007 2007 2008 2008 2008 2008
91. REVEALING FACTORS FOR SUCCESS
Profacts is active in more then 10 sectors ...
AUTOMOTIVE FMCG
RECRUITMENT GPS
BANKING INSURANCES
ICT PHARMACEUTICAL
TELECOM ENERGY
98. MARKETING & SENSORY RESEARCH
OUR PASSION
Sensory research
FTF research (Mobile Unit)
Eye-tracking / Eye|watch
Telephone research
Tachistoscope
Online research
Trained panel
Panel services
Consumer panel
Fieldwork in Europe
Taste lab
99.
100. Sensory Safari
Note down in your agenda
SENSORY SAFARI
• March 26th 2009
• 18u
• At Rogil in Leuven
106. SAS Breadth of Analytic Offering…
• Statistical Analysis
• Survey Design/Analysis
• Data Mining
• Text Mining
• Time Series Mining
• Forecasting
• Quality Improvement
• Operations Research
110. The Focus
Attract Retain Grow Fraud Risk
Driving and Maximizing Profit
111. The Vision
Operational
Processes
Interaction data Attitudinal data
Descriptive data Behavioral data
Enterprise
Enterprise
Data
Data
Sources
Sources
112. The Acceptance
• The rise of the agnostics
Science vs. Chance
In numbers we trust!
113.
114.
115.
116. The myth of the „best‟ algorithm
lessons learned from innovations in
data sampling and data pre-processing
for marketing analytics
Dr. Sven F. Crone
Deputy Director, Ass. Prof.
117. Associated Experts
Prof. Paul Goodwin
Directors Dr. Andrew Eaves
Prof. Robert Fildes
Prof. Peter Young Research & PhD students
Dr.Sven F. Crone Heiko Kausch, RA
Stavros Asimakopoulos
Researchers Xi Chen
Dr. Steve Finlay Bruce Havel
Dr. Alastair Robertson Suzi Ismail
Dr. Didier Soopramanien Nikolaos Kourentzes
Dr. Kostas Nikolopoulos Ioannis Stamatopoulos
Andrey Davidenko
Prof. Stephen Taylor Charlotte Brown
Dr. Wlodek Tych Hong Juan Liu
Prof. David Peel T Hu
Prof. Peter Pope John Prest
Huang Tao
Visiting Researchers
Prof. Geoff Allen
Dr. Yukun Bao
Young-Sang Cho
118. “Take away this pudding, it has no
theme.” Sir Winston Churchill (1915)
119. Agenda
• Sampling issues in Data Mining
• Case study 1: Direct Marketing
• Cross-selling of Magazine subscriptions
• Effect of data preprocessing: Sampling
• Interaction of Sampling with Scaling & Coding
• Case study 2: Credit & Behavioral Scoring
• Predicting consumer credit default
• Effects of sample size
• Effects of sample distribution
• Case study 3: Online Shopping Behaviour
• Predicting consumer shopping channel choice
• Sample distribution & multiple classes
• Conclusion & Take-aways
120. Why (Under/Over) Sampling?
• Knowledge Discovery (KDD) = non-trivial process of identifying
valid, novel, useful patterns in large data sets
• Data Mining = only one single step in the KDD process
• Data sample determines the whole process! ( GIGO)
• “Research seems preoccupied with algorithms” [Hand 2000]
SAS SEMMA DM-Process
Monitoring
CRISP-DM Process
121. Sampling in Direct Marketing Literature?
Data reduction** Data projection
Input Paramete Feature Re- Continuous attributes Categories
type* Methods*** r tuning Selection sampling Standardisation Discretisation Coding
[2] 2 BMLP, LR, LDA, QDA X X
[42] 1 MLP, LR, CHAID X X
[43] 2 MLP, RBF, LR, GP, CHAID X X
[44] 3 MLP, LR, LDA X X
[4] 2 CHAID, CART X
[6] 2 MLP, LR X X X X X
[9] 2 LVQ, RBF, 22 DT, 9 SC X X
LDA, LR, KNN, KDE, CART, MLP,
[45] 2 X X
RBF, MOE, FAR, LVQ
[3] 1 MLP X X
[7] 2 LSSVM X X X
[11] 2 LR, LS-SVM, KNN, NB, DT X X X
LDA, QDA, LR, BMLP, DT, SVM,
[10] 1 X X
LSSVM, TAN, LP, KNN
[46] 2 LR, MLP, BMLP X X
LSSVM, SVM, DT, RL, LDA, QDA,
[47] 2 X X
LR, NB, IBL
[48] 1 DT, MLP, LR, FC X
[49] 1 FC X X
Majority of direct marketing papers focus on algorithm tuning
Only 3 papers consider Resampling / Instance Selection
No analysis of the interaction with Sampling & Projection & …
122. Classification
since last purchase … Many Last campaign
No response
Subscribed to magazine
Few … Days
1… Number of subscriptions … Many
Database of customers (instances)
Known attributes for all customers (age, gender, existing subscriptions, …)
Known response (class membership) of buyers & non-buyers from past mailings
Build a model to separate classes decision boundary of different complexity
123. Classification
since last purchase … Many
No response
Subscribed to magazine
Class unknown
Few … Days
1… Number of subscriptions … Many
Use the decision boundary to classify unseen instances
Calculate on which side of hyperplane the instances lie (or distance)
Assign class to unseen instances
124. Reality Check: Imbalanced classes
since last purchase … Many No response
Subscribed to magazine
Problem
• Classifiers are biased towards
the majority class
• Shifts the decision boundary
• Error / Accuracy based learning
Few … Days
creates naïve classifiers
• Invalid separation of classes
1… Number of subscriptions … Many
Balanced dataset = class distributions are equal P(x|y=A)=P(x|y=B)
proportional sampling or stratified sampling feasible
Imbalanced dataset = class distributions unequal P(x|y=A)>>P(x|y=B) `
The class of interest is often the minority (in most business applications)
125. Imbalanced Data Sampling
since last purchase … Many No response
Subscribed to magazine
Stratified Random Sampling
divide DB in mutually exclusive
strata (subpopulations) & draw
random samples from each
Proportional
assure proportions in samples
Few … Days
equal those in population
Disproportional
weighted over-& undersampling
of important classes
1… Number of subscriptions … Many
Size of the sample?
Distribution / location of the sample?
126. Random Undersampling
since last purchase … Many
No response
Subscribed to magazine
Benefits
• Helps detect rare target levels
Risks
• Biases predictions (correctable)
• Looses information contained in
Few … Days
instances of the majority class
• Creates different boundaries
• Increases prediction variability
1… Number of subscriptions … Many
•…
Exclude random instances of the majority class
Retain all instances of the minority class
Establish a balanced class distribution
127. Random Oversampling
since last purchase … Many No response
Subscribed to magazine
Benefits
• Helps detect rare target levels
• No loss of information
Risks
• Biases predictions (correctable)
Few … Days
• Increases prediction variability
• Increases processing time
1… Number of subscriptions … Many
Retain all instances of the majority class in the sample
Duplicate identical instances of the minority class
Establish a balanced class distribution
128. Ready for more theory…?
x
rather some case studies ...!
129. Agenda
• Sampling issues in Data Mining
• Case study 1: Direct Marketing
• Cross-selling of Magazine subscriptions
• Effect of data preprocessing: Sampling
• Interaction of Sampling with Scaling & Coding
• Case study 2: Credit & Behavioral Scoring
• Predicting consumer credit default
• Effects of sample size
• Effects of sample distribution
• Case study 3: Online Shopping Behaviour
• Predicting consumer shopping channel choice
• Sample distribution & multiple classes
• Conclusion & Take-aways
130. Business Case:
Direct Marketing/Response Optimization
• Sell a magazine subscription to existing customers
• Whom to send mail to? (Which customers are most likely to respond?)
• How many customers to contact? (What is the optimal mailing size?)
Corporate project with leading German Publishing House
Provided data set of past mailing campaigns
Benchmark novel methods against in-house SPSS Clementine
Explore Neural Networks (NN) an Support Vector Machines (SVM)
131. Benefits of Direct Marketing
Simple With data mining
Addressees 100.000 Top 40% = 40.000
Cost 2€/mail = 200.000€ 2,5€/mail = 100.000€
Response rate 0,5% = 500 1,0% = 400
Sales volume 300€ 300€
Sales volume 150.000€ 120.000€
Revenue -50.000€ 20.000€
Smaller mailing (number of letters sent) lower costs (Euro 1.- per letter)
Higher response rate higher revenue
More specific mailing lower cost
More relevant information higher customer satisfaction
132. NN get worse with learning …
• Wish to implement Neural Networks for next campaign
• In-house team (with no NN knowledge) outperformed us EVERY TIME!
• Analyzed software, training parameters, etc. internal competition
• Observed expert in building models … !
Pred Pred Sum Pred. Pred. Sum
% %
C0 C1 C0 C1
C0 61.86 38.14 100 C0 72.96 27.04 100
C1 55.09 44.81 100 C1 62.02 37.98 100
116.95 82.95 54.26 134.98 65.02 55.47
Pred. Pred. Sum
%
C0 C1
C0 52.87 43.37 100
C1 47.13 56.63 100
100 100 54.75
133. Experimental Design:
Different data pre-processing
Handle categorical Scale numerical
Different Encoding Different Scaling
features
n, n-1, thermo, ordinal
features Standardise
Discretise,
Adjust imbalanced Decide on sample
Different Sampling
class distributions
Over-& Undersampling
size and method
Handle outliers Select useful
features
Evaluate across 3 algorithms:
Neural Networks (MLPs), Support Vector Machines & Decision Trees
134. Dataset Structure
Data set size Data set structure
• 300,000 customer records • 18 categorical features
• 4,019 subscriptions sold • 35 numerical features
• Response rate of 1.3% • Binary target variable
Evaluated the Impact of Data Preprocessing
• Data Sampling (over sampling vs. undersampling)
• Categorical attribute Encoding (N, N-1, thermo, ordinal)
• Continuous attribute Projection (Binning vs. Normalisation)
• Continuous attribute Scaling ( [0,+1] vs. [-1,+1] range)
Multifactorial design to evaluate impact across multiple methods
Neural Networks (NN)
Support Vector Machines (SVM)
Decision Trees (CART)
135. Sampling
Created 2 Dataset Sampling candidates
Data partition (number of records)
Oversampling Undersampling
Data subset Class 1 Class -1 Class 1 Class -1
Training set 20,000 20,000 2,072 2,072
Validation set 10,000 10,000 1,035 1,035
SUM 30,000 30,000 3,107 3,107
Test (hold-out) set 912 64,088 912 64,088
Different balancing in the training data
Original distribution in the test data (65,000 instances)
136. Results
Increase
Increase
Increase
Oversampling outperforms undersampling consistently!
Gain in Lift depends on method (different sensitivity)
Oversampling has higher impact than data coding & scaling
137. Recommendations from Case Study
• Sampling
• Oversampling outperfoms undersampling for all methods
• Undersampling: better in-sample results & worse out of sample
• Choice of method
• NN & SVM better than CART
• Encoding & Projection
• SVM: avoid Ordinal coding (e.g. 1,2,3) all other similar (incl. N !)
• NN: avoid standardization & ordinal encoding
• DT / CART: use temperature, all others similar (incl. ordinal)
Binning & Scaling of continuous attributes irrelevant for all methods!
Use Undersampling & N-1 encoding with SVM & NN
Best preprocessed SVM lift of 0.645 on test set … BUT …
138. Results across Pre-processing
Preprocessing: higher impact than method selection
Lift-variation per method from Sampling/Scaling/Coding
> Difference of Lift between competing methods!
Lift performance on Arithmetic Mean Performance Geometric Mean Performance
Test data subset on Test data subset on Test data subset
0,65 0,58 0,58
0,57
0,64 0,57
0,56
0,55
0,63 0,56
GM test
Lift test
AM test
0,54
0,62 0,55
0,53
0,52
0,61 0,54
DPP causes 50%-70% of the 0,51
differences between models
0,60 0,53 0,50
NN SVM DT NN SVM DT NN SVM DT
Method Method Method
Results are consistent across error measures
Experiments allow identification of „best practices‟ to model methods
Best-practice preprocessing varies between methods
139. Agenda
• Sampling issues in Data Mining
• Case study 1: Direct Marketing
• Cross-selling of Magazine subscriptions
• Effect of data preprocessing: Sampling
• Interaction of Sampling with Scaling & Coding
• Case study 2: Credit & Behavioral Scoring
• Predicting consumer credit default
• Effects of sample size
• Effects of sample distribution
• Case study 3: Online Shopping Behaviour
• Predicting consumer shopping channel choice
• Sample distribution & multiple classes
• Conclusion & Take-aways
140. Business Case: Predicting
Customer Online Shopping Adoption
• Traditional buying process is offline & simultaneous “bricks” store
• Introduction of the Internet changes consumer behaviour
• Seek information online & offline
• Purchasing online & offline
Changing purchasing behaviour through internet adoption
Changing purchasing behaviour through Technology Acceptance
• Development of heterogeneous Purchasing Behaviour
• Example: Purchasing electronic durable consumer goods
• Search for product info (e.g. video cameras) online
test product in-store
search for best deal on internet & purchase
Search for Information Online Purchase Online
Online
Shoppers
Browsers
Search for Information Offline Purchase Offline Non-Internet
Shoppers
141. Stages of Internet Adoption
1. OFFLINE BUYERS
Information gathering
& purchasing in Stores
2. BROWSERS
Information gathering online
& purchasing in stores
3.ONLINE BUYERS
Information gathering
& purchasing online
142. Motivation
DIDIER: Marketing Modelling SVEN: Data Mining Perspective
• Econometric / Marketing Domain • IS/OR/MS Domain Data Mining
• Seeks to explain how customers behave in • Seeks to accurately predict regardless of
online shopping explanation why customers buy
• Use of „black-box” logistic regression • Use of “black-box” methods from
models computational intelligence
Models class membership to identify Models class membership to
causal variables that explain choices accurately classify unseen instances
Descriptive & Normative Modelling Predictive Modelling
Best practices Best practices
balance datasets for distribution Rebalance datasets for equal distribution
representative of population of target variables
Use ordinal variables & nominal variables Recode ordinal binary scale
without recoding Rescale & normalise data to facilitate
Do not normalise / scale data learning speed etc.
same dataset & same objectives & similar methods
Conflicting “best practice” approaches to modelling
Outside of most software simulators!!! Implicit knowledge?
… WHO IS “CORRECT”? WHAT IS THE IMPACT?
143. Dataset
• Survey on Internet Shopping Behaviour
• 5500 UK households 685 respondents
• Adjusted for age, income etc. of customers (older less likely to buy)
• Adjusted for product specific risk of online shopping for branded
durable consumer goods (inspection required to some extent)
• 73 questions on factors related to internet shopping, products etc.
Online Shopping Factors:
“Going to the shops is as convenient
as Internet shopping” Demographics
Class 1:
“I would buy online if products are Browse Ônline &
branded” etc. [1=strongly agree; …] Buy Online
Internet Class 2:
Logistic Regression
specific Browse Online &
Neural Networks
Demographic Factors Factors Buy Offline
Age, Gender, Income Class 3:
Browse &
Online Buy Offline
Internet Utility Factors shopping
specific
Score from 6 correlated variables Factors
Input Variables Models Output Variables
Mixed scale of nominal, ordinal, interval
144. Imbalanced Classification problem
• Split of Dataset for Training, Validation and Test {50%;25;25%}
• Distribution of target classes is skewed
{65% online buyers; 22.5% browsers; 12.5% offline shoppers}
• Rebalancing of data sets through over- & undersampling)
Dataset Dataset
Imbalanced Imbalanced
Oversampling Oversampling
Undersampling Undersampling
400 400
300 300
Count
Count
200 200
100 100
Data Subset Data Subset
Training Training
Validation Validation
Test Test
0 0
Online- Browsers Offline-
Online- Browsers Offline-
Online- Browsers Offline-
Online- Browsers Offline-
Online- Browsers Offline-
Online- Browsers Offline-
Shoppers Shoppers
Shoppers Shoppers Shoppers Shoppers
Shoppers Shoppers Shoppers Shoppers
Shoppers Shoppers
147. Summary
Oversampling outperforms other samplings
- Across Different Datasets
- Across various data preprocessing
Methods show different sensitivity to Sampling
- More variation from sampling, coding & scaling than between methods
- Using different preprocessing variants is important in modeling
Various sophisticated extensions exist
- SMOTE (Synthetic Minority Oversampling Technique)
- K-nearest Neighbor sampling (removal / creation)
- One-class learning etc. …
Extend your bad of tricks …
- … and experiment with imbalanced sampling!
148. Questions?
Sven F. Crone
Lancaster University Management School
Centre for Forecasting
Lancaster, LA1 4YX
email s.crone@lancaster.ac.uk
SYt Yt 1 (1 )SYt 1