In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
3. 6x
"Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine
learning” is basically feature engineering." ~ Andrew Ng
Feature Engineering
“… some machine learning projects succeed and some fail. What makes the difference? Easily the
most important factor is the features used.” ~ Pedro Domingos
“Good data preparation and feature engineering is integral to better prediction” ~ Marios Michailidis
(KazAnova), Kaggle GrandMaster, Kaggle #3, former #1
”you have to turn your inputs into things the algorithm can understand” ~ Shayne Miel, answer to
“What is the intuitive explanation of feature engineering in machine learning?”
36. Feature Encoding – Weight of Evidence
6x
Feature Outcome WoE
A 1 0.4
A 0 0.4
A 1 0.4
A 1 0.4
B 1 0.74
B 1 0.74
B 0 0.74
C 1 -0.35
C 1 -0.35
Non-
events
Events
% of
non-
events
% of
events
WoE
A 1 3 50 42 ln
(1 + 0.5)
2
E
(3 + 0.5)
7
E
= 0.4
B 1 2 50 29 ln
(1 + 0.5)
2
E
(2 + 0.5)
7
E
= 0.74
C 0 2 0 29 ln
(0 + 0.5)
2
E
(2 + 0.5)
7
E
= −0.35
37. Feature Encoding – Weight of Evidence and Information Value
6x
Feature Outcome WoE
A 1 0.4
A 0 0.4
A 1 0.4
A 1 0.4
B 1 0.74
B 1 0.74
B 0 0.74
C 1 -0.35
C 1 -0.35
Non-
events
Events
% of
non-
events
% of
events
WoE IV
A 1 3 50 42 ln
(1 + 0.5)
2
E
(3 + 0.5)
7
E
= 0.4 0.5 − 0.42 ∗ 0.4 = 0.032
B 1 2 50 29 ln
(1 + 0.5)
2
E
(2 + 0.5)
7
E
= 0.74 0.5 − 0.29 ∗ 0.4 = 0.084
C 0 2 0 29 ln
(0 + 0.5)
2
E
(2 + 0.5)
7
E
= −0.35 0 − 0.29 ∗ −0.35 = 0.105
0.221
𝐼𝑉 = M % 𝑛𝑜𝑛 − 𝑒𝑣𝑒𝑛𝑡𝑠 − % 𝑒𝑣𝑒𝑛𝑡𝑠 ∗ 𝑊𝑜𝐸
41. Feature Interaction – how to find?
6x
• Domain knowledge
• ML algorithm behavior (for example analyzing GBM splits or linear
regressor weights)
Feature Interaction – how to model?
• Clustering and kNN for numerical features
• Target encoding for pairs (or even triplets and etc.) of categorical
features
• Encode categorical features by stats of numerical features