There is a deeply symbiotic relationship between machine learning/predictive modeling and Big Data. Machine learning theory asserts that the more data the better. Empirical observations suggest that more granular data, a hallmark of Big Data, further improves performance. Predictive modeling is one of the core techniques that measurably delivers value across many industries and demonstrates the value of Big Data.
However, there is a surprising paradox of predictive modeling: when you need models most, even all the data is not enough or just not suitable. The foundation of predictive modeling requires that you have enough training data with the respective outcomes, preferably IID. But often this data is not available: there are only so many people buying luxury cars online to inform my targeting models. I can never observe what happens BOTH when I treat you AND when I don’t – which is what I need to make causal claims and measure the impact of strategic decisions. To allocate sales resources I love to know what a customer’s budget is – but maybe even he does not know.
So in the days and age of Big Data there remains an art to machine learning in situation where the right data is scarce. This talk will present a number of cases where enough of the right data is fundamentally not obtainable and how creative data science can still solve them.
3. Income Age Buy
123,000 30 yes
51,100 40 yes
68,000 55 no
74,000 46 no
23,000 47 yes
100,000 49 no
Data forPredictiveModeling
Target
Examples
Features
7. WalletisNEVERobserved
We observe
this in the
data
But we do not
observe this
IBM Sales to
this Company
Company Revenue (D&B)
Wallet/Opportunity
How can we make this a
predictive modeling problem?
9. 9
REALISTICWalletsas quantiles
Motivation
Imagine 100 identical firms with identical IT needs
Consider the distribution of the IBM sales to these firms
Bottom firms should spend as much as the top
Define wallet as high percentile of spending conditional on the customer
attributes
Frequency
IBM Sales
Wallet Estimate
28. How BigData andOptimizationis
killingMetrics
90% of clicks are ‘accidental/non intentional’
10% are meaningful, and changes can be
measures
Optimization can find structure in the other
90%
You will end up with only non-intentional …