Feature engineering is one of the most important, yet elusive, skills to master if you want to be a good data scientist. Machine learning competitions are hardly ever won with strong modeling techniques alone -- it is the combination of creative feature engineering and powerful modeling techniques that makes the difference. This tutorial will give the audience practical tips and tricks to improve the performance of machine learning algorithms. We will broadly look at feature engineering for applied machine learning, touching on subjects like: categorical vs. numerical variables, data cleaning, feature extraction, transformations, and imputation.
2. Feature Engineering
• Better data beats big data
• Applied Machine Learning is data infra,
feature engineering, modeling
• Feature engineering is turning your
data into something a model
understands
• Creativity, Inquisitive, Agility
4. Onehot Encoding
• Encode k variables into a one-of-k
array sized k
• Bag of words
• Linear algorithms and neural networks
• “NL” -> one_of_k(“NL”) -> [0, 0, 0, 1]
5. Hash Encoding
• Encode k variables into a one-of-h
array sized h
• Collisions
• Fast & Memory-friendly
• “NL” -> hash(“NL”) -> [0, 0, 1]
6. Label Encoding
• Give k variables a unique numerical ID
• Tree-based algorithms
• Dimensionality friendly
• "NL" -> unique_id("NL") -> [3]
7. Binary Encoding
• Uses binary representation of label ID
• Can encode over 4 billion categoricals
into 32 bits
• "NL" ->
binary(unique_id("NL")) ->
[1, 0, 1, 1, 1, 1, 1]
8. Count Encoding
• Replace variable with its count in the
train set
• Captures popularity of the variable
• "NL" ->
count_in_train(“NL”) ->
[5]
9. Rank Count Encoding
• Uniquely rank a variables count in train
set
• Avoid collisions and outliers
• "NL" ->
rank(count_in_train(“NL”)) ->
[7]
10. Likelihood Encoding
• Replace variable by its target average
• Avoid overfit
• "NL" ->
mean_of_target(“NL”)) ->
[0.66]
11. Embedding Encoding
• Use a model to create an embedding
• Faster & More memory friendly
• “NL”, “F” ->
nn_embed([“NL”, “F”])) ->
[0.66, 0.71, 0.05]
24. Case Study
• Predict fraudsters from “name” and “email” form
• Expansion
• Temporal
• Aggregate Statistics
• Randomness
• Interactions
25. Conclusion
• Use XGBoost
• Label encode categorical variables
• Impute NaNs with -999, and set missing
parameter
• Use subsampling to quickly test new variables
• Try everything until you reach a plateau or
deadline