BAQMaR 2008

Online & offline

Marketing research domains

Friend organisations

Neighbour countries

Marketing Researchers TODAY ?!

C L
Well known
Popular
Unique

Me, MySpace & I
Visual (n) ethnography among 13-17 year olds

Joeri Van den Bergh (InSites Consulting)
Veerle Colin (MTV Networks)

The Research Briefing

WHY TELL ME WHY

Getting intimate with our target groups

ID construction of adolescents in a
digitalised world

Social groups as an extension of
psychographic segments

Role of brands in ID construction within
social groups

The Research Approach

I SEE YOU, BABY

Traditional
ethnography
context

Participant

R

Hawthorne effect
Researcher
gaze
Time & Cost
intensive

From traditional to visual ethnography
Traditional 360° visual
ethnography ethnography
context context

Participant Participant

R

R
Contact with researcher
Hawthorne effect
via 2.0 tools
Researcher “Informer” gaze: what is
gaze important?
Time & Cost Follow multiple participants from anywhere
intensive

360° Ethnography

1.User generated MM ethnography
= Observation takes place via photos/video
taken by the participants to the study
= Participants observe their own
environment & report back to the
researcher

Ethnographic blog: identity related tasks

Non me Aspirational
Pictures of... me
Clothes you no Pictures of...
longer want to Other groups
wear Favorite clothes
Youngsters where youngsters where
you so not want to you would like to Pictures of other groups that
be friend with Personal me be friend with are different but okay
Pictures of...

Place where you can Pictures of youth of today
really be yourself
Clothes you wear at
home
Pictures of normal persons
Objects that are
typical for me

Social me
Pictures of...

Important persons
in your life
Clothes you wear
to go out
Groups of
youngsters you like

360° Ethnography
1.User generated ethnography
= Observation takes place via photos/video taken
by the participants to the study
= Participants observer their own environment &
report back to the researcher

2.Nethnography: social life of teenagers is very
much MOVING ONLINE !!!
= observation of the online behaviour & content
of a target group or within a certain webspace

Nethnography: we are your friends

Social Network Sites: nethnography

Social
Personal
identity
identity

Nicknam
e
Conversation
Clan member Profile Monitoring on
ship text guestbook
Profile
picture
Photo
collection

And for the datamining freaks among you, Annelies ripped the internet

• 300 active participants of netlog randomly
selected. Equal spread age x gender
• Webcrawlers to „scan‟ pages of netlog and
substract content.
• Textmining: profile pages – photo tags –
clan membership – conversations on the
guestbook

Other online behaviour: tracking tool

That‟s why we do call it (visual) (n)ethnograpy: shoot me!

100% UGC OBSERVED BY US

The Research Analysis

THE HARD PART

Social Networks are not so social as you think they are

Only 32% of the conversations on the guestbook
are „interactive‟!
Food The rest are all single statements. kids
Shoes Litte
Mobile phones Cars
Quarrel

Miss you
Making fun
Confirm friendship
Travel
Music Welcome by strangers Transport
It was great

Feedback on picture I’ am bored

Youth movement Alcohol Sleeping
How are you?

Practical appointment School
Feedback on profiles
MSN Festivals

Congratulations Family Gaming
Age

Food
Party Sport
Online movies Express love
Study
Tokio hotel

one of the 3 key research questions

Social groups among today‟s youngsters

To what
youngster
group
do I belong?

Methodology

M
E

Non
group Aspira
Social tional
Group Group

Differe
nt but
OK

REAL WORLD
WE ME (I’m better)
CHANGE

Skater
Alternative

Fashion girl
Fashion boy

Tektonic Rapper
THINK

Rockers

DRINK
Hippies
MAINSTREA
M Breezer sluts

Punk

Jumpers

emo

Nerds
Gothic

Geek girls
CONSERVATISM

spartacus121 best pk (runescape)
30/04/2008 19:07:55 door laurent
ik hit iets in de 27 en dat is al zot hoog zij hits zijn
gestoord lol. zen zwaard is 140M (ik heb 10 M)

REAL WORLD
CHANGE

SKILLS SKILLS SKILLS SKILLS

SKILLS SKILLS
SKILLS Skater
Alternative SKILLS
SKILLS Fashion girl
SKILLS Fashion boy
SKILLS
SKILLS SKILLS
SKILLS SKILLS
SKILLS Tektonic
SKILLS Rapper
THINK

Rockers SKILLS

DRINK
SKILLS

Hippies
Breezer sluts
SKILLS SKILLS SKILLS SKILLS

Punk
SKILLS SKILLS
SKILLS SKILLS
Jumpers
SKILLS
emo
SKILLS
SKILLS
SKILLS SKILLS
SKILLS SKILLS
Gothic
SKILLS
Nerds
SKILLS SKILLS
SKILLS

Geek girls
CONSERVATISM

REAL WORLD
CHANGE

LOOKS LOOKS LOOKS LOOKS

LOOKS LOOKS
Skater LOOKS
Alternative LOOKS
LOOKS Fashion girl
LOOKS Fashion boy
LOOKS
LOOKS LOOKS
LOOKS LOOKS
Tektonic LOOKS
LOOKSRapper
THINK

Rockers LOOKS

DRINK
LOOKS

Hippies

LOOKS Breezer sluts
LOOKS LOOKS LOOKS
Punk LOOKS LOOKS
LOOKS LOOKS
LOOKS Jumpers
LOOKS
emo
LOOKS
LOOKS LOOKS
LOOKS LOOKS
LOOKS Nerds
LOOKSGothic LOOKS
LOOKS

Geek girls
CONSERVATISM

REAL WORLD
CHANGE
SKILLS LOOKS LOOKS LOOKS LOOKS
SKILLS SKILLS SKILLS

SKILLS SKILLS LOOKS LOOKS
SKILLS Skater LOOKS
Alternative SKILLS LOOKS
SKILLS LOOKS Fashion girl
SKILLS LOOKS Fashion boy
SKILLS LOOKS
SKILLS LOOKS LOOKS
SKILLS
SKILLS LOOKS LOOKS
SKILLS
SKILLS Tektonic LOOKS
SKILLS LOOKSRapper
THINK

Rockers SKILLS LOOKS

DRINK
SKILLS LOOKS

Hippies

LOOKS Breezer sluts
LOOKS LOOKS LOOKS SKILLS SKILLS SKILLS SKILLS

Punk LOOKS LOOKS SKILLS SKILLS
LOOKS LOOKS SKILLS SKILLS
LOOKS Jumpers
SKILLS
LOOKS SKILLS
emo
LOOKS SKILLS
LOOKS SKILLS
Nerds
LOOKSGothic LOOKS SKILLS
LOOKS SKILLS
SKILLS

Geek girls
CONSERVATISM

How this research changed our lives

REMEMBER ME

6 changes that rocked our socks off

• The tools for ID construction have changed

• New online quali methods proved to be efficient

• Our target group = the new and better quali researchers

• New reflection kit for entire MTV Networks staff

• Closer connection with MTV & new clients

• Redefine content strategy of TMF on screen & on line

4C Consulting

Introduction to our services

Our Mission | Boosting your customer value

4C Consulting

helps companies

win, keep and grow

customer value

Our Solutions | Call us for…

Business
Customer Value requirements
Strategy definition

Package selection
Customer & implementation
Insight

Process Post-launch
Excellence care

Our Focus | Boosting your customer value

Increase Revenue Reduce Costs

1. Acquire new customers 1. Efficient Delivery (Process
Excellence)
2. Sell more to existing customers
• More of the same (increase turnover) 2. Align value propositions
• Expand value proposition portfolio
(cross-sell & product development *)
3. Advanced pricing
• Upgrade value
proposition (Upsell)

3. Prevent existing
customers from leaving

Business Intelligence Practice

SCV ROMSCCI Competing on
Analytics

Infrastruc-
tuur

Audit Exploitatie Coaching

Data Quality

Why 4C Consulting | 7 compelling reasons

1. 100% focus on customer value
management

2. Result-oriented project approach

3. Connecting marketing, sales &
customer care with senior management
and IT

4. Independent consultant for 10 years

5. Experienced crew, passionate about
marketing, sales & customer care

6. Value based pricing model

7. Satisfied & loyal customers: 90
customers, more than 380 projects

Optimize your business with Business
&Decision

Michel Meulders
- Domain Manager -

Business & Decision Benelux
Founded in 2002
Merger of several companies specialised
in BI Business & Decision Benelux is :
Consulting & System Integrator :
• a multi-specialist, in specific
- More than 300 consultants technology fields :
- About 18 mio Euro turnover in 2007 • Business Intelligence
- 58% organic growth comparing to 2006 • Customer Relationship Management
• Life sciences
- Last acquisition : BnV Group (BE+NL)
• Risk & Compliance
Turnover evolution (consolidated)

25000
Belgium Luxembourg Netherlands
• with foreign offices in Brussels,
Thousands

Amsterdam, Luxembourg
20000

15000 • Top accounts in finance, pharma,
10000 telco, distribution, industry (Fortis, ING,
ABN Amro, Dexia, GSK, UCB, Proximus,
5000
Belgacom, Carrefour, Honda,…)
0
2004 2005 2006 2007 Obj 2008

For more info see
http://www.businessdecision.com

Belgian Federation of
Market Research Institutes

www.febelmar.be

Febelmar mission
Development and promotion of market
research and opinion polls in Belgium
Protecting the sector interests
Watching over correct use of deontological
rules of market research in all phases of the
market research process
Stimulating continuous improvement of
quality of service in market research
Being a platform for communication,
exchange of expertise and networking

Members

27 agencies.

Together they represent about 75% of
the total market research expenditures in
Belgium.

What is I4BI?

i4bi is specialized in implementing Business Intelligence Solutions in
your company.

Our team of BI experts has deep functional and technical experience
with Application development, Business Model definitions and Data
Warehousing. Functional/technical designs, development, application
role out, training etc… are phases in a project where our consultants
have many years of experience.

i4bi consultants have a deep knowledge of the Oracle Business
Intelligence products and solutions.

i4bi sponsors the development of an independent analytical branch,
which will probably see the light in 2009

What do we provide?

Strategic Decision Making

To Support

Business Technical Analytical
Experience Abilities Expertise

We Provide

Expertise

• Analytical Expertise
• Data Mining
• Statistical modelling
• Predictive analysis
• Basel II compliant modelling
• Forecasting

• Business Analysis Expertise
• Reporting
• Delivering Business Insight to decision makers
• Marketing Analysis
• Data Quality
• Use of technical tools such as SAS – SPSS – Statistica to
support & extend business knowledge

Contact

• For more general information:

www.I4BI.be

• For more analytical information:

Filip.deroover@I4BI.be

InSites Consulting

6 beliefs in 60 seconds

We believe ...
in the empowered consumer

Human-to-human interactions are more powerful than
ever and can make or break your brand

“Consumers are beginning in a very
real sense to own our brands and
participate. We need to begin to
learn how to let go”

A.G. Lafley, CEO & Chairman of P&G

We believe ...
in giving back

Rewarding experiences for participants
Active involvement of panel members
Charity contributions

We believe ...
in connecting

Everything we do is aimed at strengthening connections
between you, your market and us

“Connected Research” brings you closer to your market
and taps into the wisdom of the crowds

Some of our connected research methods
Research communities
Bulletin boards
Blog research
Online discussion groups

More information: http://connectedresearch.insites.eu/

We believe ...
in the power of new research methods
for better marketing decision making

Informational
Providing more depth to research insights

Transformational
Doing things that were previously not possible

Automational
Conducting research more efficiently

We believe ...
in 1 + 1 = 3

Old and new methods need to be optimally “fused” in
order to fully grasp the new customer / consumer reality

We believe ...
in the power of our team

People make the difference
Open, forward thinking, dedicated, passionate
Specific knowledge centers

performance management  consulting  technology

Welcome to Keyrus
BAQMaR, 17 December 2008

©Keyrus – all rights reserved

About Keyrus (Belgium)

• founded in 1996 as SOLIDPartners
• focus on performance management, business
intelligence & data warehousing
• strong and balanced client base spread over different
industries
• +100 consultants specialised in both technical and
business domains
• part of Keyrus group (France)

Keyrus‟ global footprint

 head office in Paris

 present in 9 countries

 +1300 employees

 listed on Paris stock
exchange Euronext

Vision & mission

Keyrus will be one of the few leading service
providers
in the area of performance management.

We help our clients to effectively design,
build and operate
the adequate performance management
organization and solutions
in an integrated end-to-end fashion.

Portfolio of solutions & services

Information Business Analytic People and Corporate
Management Intelligence Applications Processes Performance
Platforms Management

IM BIP AA P&P (C)PM

BI Layer (reports, OLAP,
dashboards, alerts)
outflow functions
source systems

data warehouse
& applications

data delivery &
management
information

& data marts
functions

CPM data delivery, exchange
& synchronization

CPM Applications (e.g. Analytic Applications
planning, ABM, PA) (e.g. data mining)

Contact us

Keyrus nv
info@keyrus.be

Nijverheidslaan 3/2
B-1853 Strombeek-Bever
t +32 2 706 03 00
f +32 2 706 03 09
www.keyrus.be

performance management  consulting  technology 17-Dec-2008

Who is PROFACTS ?

We are „the new kids on the block‟
in (online) market research ...

1 REVEALING FACTORS FOR SUCCESS
strategy
200% 286 people are nowyrs
growth rate working mean age
@ Profacts

2
people have
founded Profacts

Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
2006 2007 2007 2007 2007 2008 2008 2008 2008

REVEALING FACTORS FOR SUCCESS

Profacts is active in more then 10 sectors ...

AUTOMOTIVE FMCG

RECRUITMENT GPS

BANKING INSURANCES

ICT PHARMACEUTICAL

TELECOM ENERGY

Python Predictions

PREDICT

Python Predictions

GROWTH

Python Predictions

www.pythonpredictions.com

Rogil Research

A research agency with a view

MARKETING & SENSORY RESEARCH
OUR PASSION

Sensory research
FTF research (Mobile Unit)
Eye-tracking / Eye|watch
Telephone research
Tachistoscope
Online research
Trained panel
Panel services
Consumer panel
Fieldwork in Europe
Taste lab

Sensory Safari

Note down in your agenda
SENSORY SAFARI

• March 26th 2009
• 18u
• At Rogil in Leuven

Sensory Safari

5 SENSES MARKETING

We hope to welcome you in March.

Thanks for your attention !

SAS Analytics For Challenging Times

Start Focused, Think Wide

Campaign Managment Requires
Optimization

CRM is becoming Risk Managment

SAS Breadth of Analytic Offering…

• Statistical Analysis
• Survey Design/Analysis
• Data Mining
• Text Mining
• Time Series Mining
• Forecasting
• Quality Improvement
• Operations Research

SAS Innovations in Marketing
Solutions….

http://www.sas.com/feature/analytics/index.html

Copyright © 2006, SAS Institute Inc. All rights reserved.

The Mission

Drive the widespread use of
data in decision making

The Focus

Attract Retain Grow Fraud Risk

Driving and Maximizing Profit

The Vision

Operational
Processes

Interaction data Attitudinal data

Descriptive data Behavioral data

Enterprise
Enterprise
Data
Data
Sources
Sources

The Acceptance

• The rise of the agnostics
 Science vs. Chance
 In numbers we trust!

The myth of the „best‟ algorithm
lessons learned from innovations in
data sampling and data pre-processing
for marketing analytics

Dr. Sven F. Crone
Deputy Director, Ass. Prof.

Associated Experts
Prof. Paul Goodwin
Directors Dr. Andrew Eaves
Prof. Robert Fildes
Prof. Peter Young Research & PhD students
Dr.Sven F. Crone Heiko Kausch, RA
Stavros Asimakopoulos
Researchers Xi Chen
Dr. Steve Finlay Bruce Havel
Dr. Alastair Robertson Suzi Ismail
Dr. Didier Soopramanien Nikolaos Kourentzes
Dr. Kostas Nikolopoulos Ioannis Stamatopoulos
Andrey Davidenko
Prof. Stephen Taylor Charlotte Brown
Dr. Wlodek Tych Hong Juan Liu
Prof. David Peel T Hu
Prof. Peter Pope John Prest
Huang Tao

Visiting Researchers
Prof. Geoff Allen
Dr. Yukun Bao
Young-Sang Cho

“Take away this pudding, it has no
theme.” Sir Winston Churchill (1915)

Agenda
• Sampling issues in Data Mining
• Case study 1: Direct Marketing
• Cross-selling of Magazine subscriptions
• Effect of data preprocessing: Sampling
• Interaction of Sampling with Scaling & Coding
• Case study 2: Credit & Behavioral Scoring
• Predicting consumer credit default
• Effects of sample size
• Effects of sample distribution
• Case study 3: Online Shopping Behaviour
• Predicting consumer shopping channel choice
• Sample distribution & multiple classes
• Conclusion & Take-aways

Why (Under/Over) Sampling?
• Knowledge Discovery (KDD) = non-trivial process of identifying
valid, novel, useful patterns in large data sets
• Data Mining = only one single step in the KDD process
• Data sample determines the whole process! ( GIGO)
• “Research seems preoccupied with algorithms” [Hand 2000]

SAS SEMMA DM-Process

Monitoring

CRISP-DM Process

Sampling in Direct Marketing Literature?
Data reduction** Data projection
Input Paramete Feature Re- Continuous attributes Categories
type* Methods*** r tuning Selection sampling Standardisation Discretisation Coding
[2] 2 BMLP, LR, LDA, QDA X X
[42] 1 MLP, LR, CHAID X X
[43] 2 MLP, RBF, LR, GP, CHAID X X
[44] 3 MLP, LR, LDA X X
[4] 2 CHAID, CART X
[6] 2 MLP, LR X X X X X
[9] 2 LVQ, RBF, 22 DT, 9 SC X X
LDA, LR, KNN, KDE, CART, MLP,
[45] 2 X X
RBF, MOE, FAR, LVQ
[3] 1 MLP X X
[7] 2 LSSVM X X X
[11] 2 LR, LS-SVM, KNN, NB, DT X X X
LDA, QDA, LR, BMLP, DT, SVM,
[10] 1 X X
LSSVM, TAN, LP, KNN
[46] 2 LR, MLP, BMLP X X
LSSVM, SVM, DT, RL, LDA, QDA,
[47] 2 X X
LR, NB, IBL
[48] 1 DT, MLP, LR, FC X
[49] 1 FC X X

Majority of direct marketing papers focus on algorithm tuning
Only 3 papers consider Resampling / Instance Selection
No analysis of the interaction with Sampling & Projection & …

Classification
since last purchase … Many Last campaign
No response
Subscribed to magazine
Few … Days

1… Number of subscriptions … Many

Database of customers (instances)
Known attributes for all customers (age, gender, existing subscriptions, …)
Known response (class membership) of buyers & non-buyers from past mailings
Build a model to separate classes  decision boundary of different complexity

Classification
since last purchase … Many
No response
Class unknown
Few … Days


Use the decision boundary to classify unseen instances
Calculate on which side of hyperplane the instances lie (or distance)
Assign class to unseen instances

Reality Check: Imbalanced classes
since last purchase … Many No response

Problem
• Classifiers are biased towards
the majority class
• Shifts the decision boundary
• Error / Accuracy based learning
Few … Days

creates naïve classifiers
• Invalid separation of classes

Balanced dataset = class distributions are equal P(x|y=A)=P(x|y=B)
 proportional sampling or stratified sampling feasible
Imbalanced dataset = class distributions unequal P(x|y=A)>>P(x|y=B) `
The class of interest is often the minority (in most business applications)

Imbalanced Data Sampling
Stratified Random Sampling
divide DB in mutually exclusive
strata (subpopulations) & draw
random samples from each
Proportional
assure proportions in samples
Few … Days

equal those in population
Disproportional
weighted over-& undersampling
of important classes

Size of the sample?
Distribution / location of the sample?

Random Undersampling
since last purchase … Many
No response

Benefits
• Helps detect rare target levels

Risks
• Biases predictions (correctable)
• Looses information contained in
Few … Days

instances of the majority class
• Creates different boundaries
• Increases prediction variability
•…

Exclude random instances of the majority class
Retain all instances of the minority class
Establish a balanced class distribution

Random Oversampling
Benefits
• Helps detect rare target levels
• No loss of information

Risks
• Biases predictions (correctable)
Few … Days

• Increases prediction variability
• Increases processing time


Retain all instances of the majority class in the sample
Duplicate identical instances of the minority class
Establish a balanced class distribution

Ready for more theory…?

x

 rather some case studies ...!

Business Case:
Direct Marketing/Response Optimization

• Sell a magazine subscription to existing customers

• Whom to send mail to? (Which customers are most likely to respond?)
• How many customers to contact? (What is the optimal mailing size?)
Corporate project with leading German Publishing House
Provided data set of past mailing campaigns
Benchmark novel methods against in-house SPSS Clementine
Explore Neural Networks (NN) an Support Vector Machines (SVM)

Benefits of Direct Marketing

Simple With data mining
Addressees 100.000 Top 40% = 40.000
Cost 2€/mail = 200.000€ 2,5€/mail = 100.000€
Response rate 0,5% = 500 1,0% = 400
 Sales volume 300€ 300€
Sales volume 150.000€ 120.000€
Revenue -50.000€ 20.000€

Smaller mailing (number of letters sent)  lower costs (Euro 1.- per letter)
Higher response rate  higher revenue
More specific mailing  lower cost
More relevant information  higher customer satisfaction

NN get worse with learning …

• Wish to implement Neural Networks for next campaign
• In-house team (with no NN knowledge) outperformed us EVERY TIME!
• Analyzed software, training parameters, etc.  internal competition
• Observed expert in building models … !

Pred Pred Sum Pred. Pred. Sum
% %
C0 C1 C0 C1
C0 61.86 38.14 100 C0 72.96 27.04 100
C1 55.09 44.81 100 C1 62.02 37.98 100
116.95 82.95 54.26 134.98 65.02 55.47

Pred. Pred. Sum
%
C0 C1
C0 52.87 43.37 100
C1 47.13 56.63 100
100 100 54.75

Experimental Design:
Different data pre-processing

Handle categorical Scale numerical
Different Encoding Different Scaling
features
n, n-1, thermo, ordinal
features Standardise
Discretise,

Adjust imbalanced Decide on sample
Different Sampling
class distributions
Over-& Undersampling
size and method

Handle outliers Select useful
features

Evaluate across 3 algorithms:
Neural Networks (MLPs), Support Vector Machines & Decision Trees

Dataset Structure

Data set size Data set structure
• 300,000 customer records • 18 categorical features
• 4,019 subscriptions sold • 35 numerical features
• Response rate of 1.3% • Binary target variable

Evaluated the Impact of Data Preprocessing
• Data Sampling (over sampling vs. undersampling)
• Categorical attribute Encoding (N, N-1, thermo, ordinal)
• Continuous attribute Projection (Binning vs. Normalisation)
• Continuous attribute Scaling ( [0,+1] vs. [-1,+1] range)

Multifactorial design to evaluate impact across multiple methods
Neural Networks (NN)
Support Vector Machines (SVM)
Decision Trees (CART)

Sampling

 Created 2 Dataset Sampling candidates
Data partition (number of records)
Oversampling Undersampling
Data subset Class 1 Class -1 Class 1 Class -1
Training set 20,000 20,000 2,072 2,072
Validation set 10,000 10,000 1,035 1,035
SUM 30,000 30,000 3,107 3,107
Test (hold-out) set 912 64,088 912 64,088

Different balancing in the training data
Original distribution in the test data (65,000 instances)

Results

Increase

Increase

Increase

Oversampling outperforms undersampling consistently!
Gain in Lift depends on method (different sensitivity)
Oversampling has higher impact than data coding & scaling

Recommendations from Case Study

• Sampling
• Oversampling outperfoms undersampling for all methods
• Undersampling: better in-sample results & worse out of sample
• Choice of method
• NN & SVM better than CART
• Encoding & Projection
• SVM: avoid Ordinal coding (e.g. 1,2,3) all other similar (incl. N !)
• NN: avoid standardization & ordinal encoding
• DT / CART: use temperature, all others similar (incl. ordinal)

Binning & Scaling of continuous attributes irrelevant for all methods!
Use Undersampling & N-1 encoding with SVM & NN
Best preprocessed SVM  lift of 0.645 on test set … BUT …

Results across Pre-processing
 Preprocessing: higher impact than method selection
 Lift-variation per method from Sampling/Scaling/Coding
> Difference of Lift between competing methods!
Lift performance on Arithmetic Mean Performance Geometric Mean Performance
Test data subset on Test data subset on Test data subset

0,65 0,58 0,58

0,57
0,64 0,57
0,56

0,55
0,63 0,56

GM test
Lift test

AM test

0,54

0,62 0,55
0,53

0,52
0,61 0,54
DPP causes 50%-70% of the 0,51

differences between models
0,60 0,53 0,50

NN SVM DT NN SVM DT NN SVM DT
Method Method Method

Results are consistent across error measures
Experiments allow identification of „best practices‟ to model methods
Best-practice preprocessing varies between methods

Business Case: Predicting
Customer Online Shopping Adoption

• Traditional buying process is offline & simultaneous  “bricks” store
• Introduction of the Internet changes consumer behaviour
• Seek information online & offline
• Purchasing online & offline
 Changing purchasing behaviour through internet adoption
 Changing purchasing behaviour through Technology Acceptance
• Development of heterogeneous Purchasing Behaviour
• Example: Purchasing electronic durable consumer goods
• Search for product info (e.g. video cameras) online
 test product in-store
 search for best deal on internet & purchase

Search for Information Online Purchase Online
Online
Shoppers

Browsers
Search for Information Offline Purchase Offline Non-Internet
Shoppers

Stages of Internet Adoption

1. OFFLINE BUYERS
Information gathering
& purchasing in Stores
2. BROWSERS
Information gathering online
& purchasing in stores
3.ONLINE BUYERS
Information gathering
& purchasing online

Motivation

DIDIER: Marketing Modelling SVEN: Data Mining Perspective
• Econometric / Marketing Domain • IS/OR/MS Domain  Data Mining
• Seeks to explain how customers behave in • Seeks to accurately predict regardless of
online shopping explanation why customers buy
• Use of „black-box” logistic regression • Use of “black-box” methods from
models computational intelligence
Models class membership to identify Models class membership to
causal variables that explain choices accurately classify unseen instances
Descriptive & Normative Modelling Predictive Modelling
Best practices Best practices


 balance datasets for distribution  Rebalance datasets for equal distribution
representative of population of target variables
 Use ordinal variables & nominal variables  Recode ordinal  binary scale
without recoding  Rescale & normalise data to facilitate
 Do not normalise / scale data learning speed etc.
same dataset & same objectives & similar methods
Conflicting “best practice” approaches to modelling
Outside of most software simulators!!! Implicit knowledge?
… WHO IS “CORRECT”? WHAT IS THE IMPACT?

Dataset

• Survey on Internet Shopping Behaviour
• 5500 UK households  685 respondents
• Adjusted for age, income etc. of customers (older less likely to buy)
• Adjusted for product specific risk of online shopping for branded
durable consumer goods (inspection required to some extent)
• 73 questions on factors related to internet shopping, products etc.
Online Shopping Factors:
“Going to the shops is as convenient
as Internet shopping” Demographics
Class 1:
“I would buy online if products are Browse Ônline &
branded” etc. [1=strongly agree; …] Buy Online

Internet Class 2:
Logistic Regression
specific Browse Online &
Neural Networks
Demographic Factors Factors Buy Offline

Age, Gender, Income Class 3:
Browse &
Online Buy Offline
Internet Utility Factors shopping
specific
Score from 6 correlated variables Factors

Input Variables Models Output Variables
 Mixed scale of nominal, ordinal, interval

Imbalanced Classification problem
• Split of Dataset for Training, Validation and Test {50%;25;25%}
• Distribution of target classes is skewed
{65% online buyers; 22.5% browsers; 12.5% offline shoppers}
• Rebalancing of data sets through over- & undersampling)
Dataset Dataset
Imbalanced Imbalanced
Oversampling Oversampling
Undersampling Undersampling
400 400

300 300
Count

Count

200 200

100 100
Data Subset Data Subset
Training Training
Validation Validation
Test Test
0 0
Online- Browsers Offline-
Shoppers Shoppers
Shoppers Shoppers Shoppers Shoppers
Shoppers Shoppers Shoppers Shoppers
Shoppers Shoppers

Results without Discretisation
Logist.Reg. True Training Data Test Data
Dataset Value Online Browse Offline Online Browse Offline
Original Online 93.36 5.17 1.48 88.89 7.78 3.33 MCRtrain=54.3%
Imbalanced Browser 62.77 23.40 13.83 49.39 22.58 29.03 MCRtest =48.9%
Offline 36.54 17.31 46.15 35.29 29.41 35.29


Under- Online 57.69 30.77 11.54 64.44 23.33 12.22
Sampling Browser 26.92 48.08 25.00 32.26 25.81 41.94 MCRtrain=55.8%
Offline 17.31 21.15 61.54 29.41 35.29 35.29 MCRtest =41.8%


Over- Online 68.27 24.35 7.38 74.44 16.67 8.89
Offline 16.97 19.93 63.10 29.41 29.41 41.18 MCRtest =48.2%

Neural Net Training Data Test Data
Dataset Online Browse Offline Online Browse Offline
MCRtrain=54.4% Original Online 86.19 12.71 1.10 86.67 8.89 4.44
MCRtest =52.5% Imbalanced Browser 53.13 31.25 15.63 41.94 35.48 22.58
Offline 25.17 28.57 45.71 29.41 35.29 35.29


Under- Online 44.86 40.00 17.14 27.78 58.89 13.33
MCRtrain=54.9% Sampling Browser 14.29 48.57 37.14 16.13 32.26 51.61
MCRtest =35.7% Offline 8.57 20.00 71.43 11.76 41.18 47.06

 
Over- Online 81.22 18.23 0.55 61.11 22.22 16.67
MCRtest =75.6% Offline 15.52 0.55 99.45 0.00 11.76 88.24
Mean Classification Rate (%)

Results with Discretisation of Ordinal
Logist.Reg. True Training Data Test Data
Dataset Value Online Browse Offline Online Browse Offline
Original Online 91.51 6.64 1.85 85.56 7.78 6.67 MCRtrain=61.15%
Imbalanced Browser 54.26 36.17 9.57 48.39 32.26 19.35 MCRtest =45.1%
Offline 26.92 17.31 55.77 58.82 47.62 17.65


Under- Online 71.15 21.15 7.69 55.56 24.44 20.00
Offline 15.38 11.54 73.08 58.82 0.00 41.18 MCRtest =34.4%

 
Over- Online 68.63 22.88 8.49 70.0 21.11 8.89
Offline 13.28 14.02 72.69 17.65 23.53 58.82 MCRtest =62.3%

Neural Net Training Data Test Data
Dataset Online Browse Offline Online Browse Offline
MCRtrain=56.5% Original Online 96.13 3.87 0.00 84.44 11.11 4.44
MCRtest =45.5% Imbalanced Browser 68.75 28.13 3.13 64.52 22.58 12.90
Offline 40.00 14.29 45.17 58.82 11.76 29.41


Under- Online 57.14 40.00 2.86 25.56 72.22 2.22
MCRtest =28.0% Offline 14.29 31.43 54.29 52.94 17.65 29.41

 
Over- Online 98.34 1.10 0.55 58.89 24.44 16.67
MCRtest =79.0% Offline 0.00 0.00 100.0 0.00 5.88 94.12
Mean Classification Rate (%)

Summary

Oversampling outperforms other samplings
- Across Different Datasets
- Across various data preprocessing

Methods show different sensitivity to Sampling
- More variation from sampling, coding & scaling than between methods
- Using different preprocessing variants is important in modeling

Various sophisticated extensions exist
- SMOTE (Synthetic Minority Oversampling Technique)
- K-nearest Neighbor sampling (removal / creation)
- One-class learning etc. …

Extend your bad of tricks …
- … and experiment with imbalanced sampling!

Questions?

Sven F. Crone
Lancaster University Management School
Centre for Forecasting
Lancaster, LA1 4YX
email s.crone@lancaster.ac.uk

SYt  Yt 1  (1   )SYt 1  

Exploring Innovation
“Online panel” vs “Online „streaming/convenience”
sampling

Unfortunately, the presenters of iVOX &
Corelio cannot share their presentation
with the BAQMaR community due to
reasons of confidentiality!

BAQMaR 2008

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a BAQMaR 2008

Semelhante a BAQMaR 2008 (20)

Mais de BAQMaR

Mais de BAQMaR (20)

Último

Último (20)

BAQMaR 2008