SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
DA 592 Project
Kaggle Grupo Bimbo Contest
Berker Kozan & Can Köklü
24/09/2016
Overview
Project Description
● Kaggle Contest
● Grupo Bimbo: Mexican company of
fresh bakery products.
● Products are shipped from storage
facilities to stores.
● The following week, unsold products
are returned.
● Need to predict the correct demand
for shipping to stores.
Why Did We Pick This Project?
Why Kaggle?
● Test our “Data Science” abilities in an
international field
● Kaggle forum
● Clean data and clear goal
● More time for feature engineering and
modelling
Why This Project?
● Very common problem
● Chance to work with a very large dataset
● Deadline of the competition (30 August)
Tools
● Python 2.7
● Github
● Jupyter Notebook and Pycharm (integrated with Github)
● NLTK
● XGBoost
● Pickle, HDF5
● Scikit-learn, NumPy, SciPy
● Garbage Collector
Platforms
● Ubuntu (16 GB RAM)
● Macbook Pro (16 GB RAM)
● EC2 Instance on Amazon (100 GB RAM, 16 core
CPU)
○ 150$ for 2 days and extra 50$ for backup
● Google Cloud (100 GB RAM, 16 core CPU)
○ 50$ for 1 day
● Google Cloud Preemptible (208 GB RAM, 32 core
CPU)
○ 60$ for 3 days
○ Linux command line, connecting with SSH
○ One problem!
Data Provided
● Train.csv (3.2 GB)
○ the training set which includes week 3-9
● Test.csv (251 MB)
○ the test set which includes week 10-11
● Sample_submission.csv (69 MB)
○ a sample submission file in the correct format
● cliente_tabla.csv
○ client names (can be joined with train/test on Cliente_ID)
● producto_tabla.csv
○ product names (can be joined with train/test on Producto_ID)
● town_state.csv
○ town and state (can be joined with train/test on Agencia_ID)
Data Fields
Demanda (Target Variable)
● Mean: 7
● Median: 3
● Max: 5,000
● %75 of data is between 0-6
● Right-skewed
● This explains why evaluation metric is
“RMSLE”
● Before modelling, log target variable
(log(variable+1))
● Before submitting, take exponential
(exp(variable)-1)
Evaluation Criteria
● The evaluation metric is Root
Mean Squared Logarithmic
Error.
● Public and Private Scores
Dealing with the Large Data
To optimize RAM use and speed up XGBoost performance:
● Forced type of the data explicitly
● Converted integers to unsigned ones
● Decreased the accuracy of floating points as much as
possible
Memory usage is reduced from 6.1 GB to 2.1 GB.
An alternative approach would have been reading and
processing in chunks.
Building Models
Model 1 - Naive Prediction
We first decided to create a naive approach without using Machine Learning.
● Group training data on Product ID, Client ID, Agency ID and Route ID and
took mean of demand
● If this specific grouping doesn’t exist, default back to product’s mean demand
● If this doesn’t exist either, simply take the mean of demand
This method resulted in a score of 0.73 on public data when submitted.
Model 2 - NLTK Based Modelling
Feature Engineering
We utilized NLTK library to extract following information from the Producto file.
● Weight: In grams
● Pieces
● Brand Name: Extracted through a three letter acronym
● Short Name: Extracted from the Product Name field. We first removed the
Spanish “stop words” and then used the stemming
Modeling
1. Separate x and y (label) of train;
2. Delete train’s columns which don’t exist in test data;
3. Match the order of train and test column orders same and append test to train
vertically;
4. Merge table with Product Table;
5. Use “count-vectorizer” of Scikit-learn on brand and short_name columns to
create sparse count-word matrices;
6. Use the output of count-vectorizer to create dummy variables;
7. Separate appended train and test data;
8. Train XGBoost with default parameters on train data and predict test data.
Technical Problems
● Garbage Collection
○ We had to remove unused objects and force garbage collection mechanism manually to free
this memory
● Data size because of sparsity
○ 70+ million rows and 577 columns in memory would need ~161 GB
○ We solved this problem by using sparse matrices from SciPy library and memory was 5 GB
○ In the example below, we see “COO” sparse method that holds data only different from 0
Score and Conclusion
RMSLE score were as follows:
These scores were worse than the naive approach, so we started to think about a
new model.
Validation Test 10 week Test 11 week
0.764 0.775 0.781
Model 3 - Comprehensive
Digging Deeper Data Exploration
Train - Test difference
● We analyzed missing products, clients, agencies, routes which exist in train
but not in test
● There were 9663 clients, 34 products, 0 agencies and 1012 routes that
doesn’t exist in train data.
● The important outcome of this analysis was that: we should build a general
model that can handle new products, clients and routes which don’t
exist in train data but in test data.
Feature Engineering - 1
● Agencia
○ Agencia file shows each agency’s town id and state name. We merged this file with train and
test data on Agencia_ID column and encode state columns into integers.
● Producto
○ We used features from NLTK model, weights and pieces. In addition to them, we included
short names of product and brand id.
○ Example of short name of a product and brand id can be seen below.
○ 100,Pan Blanco 460g WON 100
200,Pan Blanco 567g WON 200
Feature Engineering - 2
We want to predict how many product are sold in a client came from an agency.
● Why don’t we look at the past numbers of this product which was sold in this
client came from this agency?
● If doesn’t exist, why don’t we look at the past numbers of this product sold in
this client?
Let’s try this logic with product’s short names and also brand id.
We named these features : Lag0, Lag1, Lag2 and Lag3
Feature Engineering - 3
Other features:
● Total $ amount of a client/product name/product id
● Total unit sold by a client/product name/product id
● Price per unit of a client/product name/product id
● Ratio of goods sold by client/product name/product id
Other features:
● Client per town
● Sum of returns of product
Total Data Size : 10.6 GB
Validation Technique
● We made 2 separate models for week 10 and week 11
● We didn’t involve “Lag1” variable in the model that predicts week 11
● We deleted first 3 weeks after feature engineering phase
XGBoost & Parameter Tuning
Why did we pick XGBoost?
● Boosting Tree Algorithm
● Both Regression and Classification
● Compiled C++ code
● Multi-Thread
Parameter Tuning:
● Max depth
● Subsample
● ColSample
● Learning Rate
Technical Problems
● Storing Data
○ Picked HDF5 over pickle and csv
● Memory and CPU
○ Max 32 core CPU, 75 GB ram
● Code Reuse and Automation
○ Object Oriented Programming with Python
○ Most of the work was automated. For example:
parameterDict = { "ValidationStart":8, "ValidationEnd":9, "maxLag":3, "trainHdfPath":'../../input/train_wz.h5',
"testHdfPath1":"../../input/test1_wz.h5".. }...
ConfigElements(1,[ ("SPClR0_mean",["Producto_ID", "Cliente_ID", "Agencia_SAK"], ["mean"]),
("SPCl_mean", ["Producto_ID", "Cliente_ID"], ["mean"])...
Model Comparison
Model Validation 1 Validation 2 Public Score Private Score
Naive 0.736 0.734 0.754
NLTK 0.764 0.775 0.781
XGBoost with
default
parameters
0.476226 0.498475 0.46949 0.49596
XGBoost with
parameter
tuning
0.469628 0.489799 0.46257 0.48666
Final Score
Looking Back…
Critical Mistakes
● Poor data exploration
● Not preparing for system outages
● Performing hyperparameter tuning
too late
… and forward
Further Exploration
● Partial Fitting
● Multiple Models
● Neural Networks
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest

Mais conteúdo relacionado

Destaque

Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015Carlos Hernandez
 
Mysqldbrentalgamesdb
MysqldbrentalgamesdbMysqldbrentalgamesdb
MysqldbrentalgamesdbDavid Bourke
 
Bike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay PatilBike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay PatilAkshay Patil
 
Titanic - Presentation
Titanic - PresentationTitanic - Presentation
Titanic - PresentationSonali Haldar
 
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Keiku322
 
Forest Cover Type Prediction
Forest Cover Type PredictionForest Cover Type Prediction
Forest Cover Type PredictionRohit Arora
 
The immune checkpoint landscape in 2015: combination therapy
The immune checkpoint landscape in 2015: combination therapyThe immune checkpoint landscape in 2015: combination therapy
The immune checkpoint landscape in 2015: combination therapyPaul D. Rennert
 
Forest Cover type prediction
Forest Cover type predictionForest Cover type prediction
Forest Cover type predictionDaniel Gribel
 
Kaggle presentation friday
Kaggle presentation fridayKaggle presentation friday
Kaggle presentation fridayDavid Bourke
 

Destaque (13)

Project Presentation
Project PresentationProject Presentation
Project Presentation
 
Chittoor.Sandeep
Chittoor.SandeepChittoor.Sandeep
Chittoor.Sandeep
 
Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015Titanic LinkedIn Presentation - 20022015
Titanic LinkedIn Presentation - 20022015
 
Mysqldbrentalgamesdb
MysqldbrentalgamesdbMysqldbrentalgamesdb
Mysqldbrentalgamesdb
 
Bike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay PatilBike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay Patil
 
Final presentation MIS 637 A - Rishab Kothari
Final presentation MIS 637 A - Rishab KothariFinal presentation MIS 637 A - Rishab Kothari
Final presentation MIS 637 A - Rishab Kothari
 
Titanic - Presentation
Titanic - PresentationTitanic - Presentation
Titanic - Presentation
 
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
 
Forest Cover Type Prediction
Forest Cover Type PredictionForest Cover Type Prediction
Forest Cover Type Prediction
 
Advanced Predictive Modeling with R and RapidMiner Studio 7
Advanced Predictive Modeling with R and RapidMiner Studio 7Advanced Predictive Modeling with R and RapidMiner Studio 7
Advanced Predictive Modeling with R and RapidMiner Studio 7
 
The immune checkpoint landscape in 2015: combination therapy
The immune checkpoint landscape in 2015: combination therapyThe immune checkpoint landscape in 2015: combination therapy
The immune checkpoint landscape in 2015: combination therapy
 
Forest Cover type prediction
Forest Cover type predictionForest Cover type prediction
Forest Cover type prediction
 
Kaggle presentation friday
Kaggle presentation fridayKaggle presentation friday
Kaggle presentation friday
 

Semelhante a DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest

Bimbo Final Project Presentation
Bimbo Final Project PresentationBimbo Final Project Presentation
Bimbo Final Project PresentationCan Köklü
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
 
Predicting Machine Failure App
Predicting Machine Failure AppPredicting Machine Failure App
Predicting Machine Failure AppAbhinav Bisht
 
DA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can KokluDA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can KokluCan Köklü
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
Under the hood of the Altalis Platform
Under the hood of the Altalis PlatformUnder the hood of the Altalis Platform
Under the hood of the Altalis PlatformSafe Software
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Lviv Startup Club
 
Sprint 44 review
Sprint 44 reviewSprint 44 review
Sprint 44 reviewManageIQ
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...PATHALAMRAJESH
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 

Semelhante a DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest (20)

Bimbo Final Project Presentation
Bimbo Final Project PresentationBimbo Final Project Presentation
Bimbo Final Project Presentation
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Predicting Machine Failure App
Predicting Machine Failure AppPredicting Machine Failure App
Predicting Machine Failure App
 
DA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can KokluDA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can Koklu
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Under the hood of the Altalis Platform
Under the hood of the Altalis PlatformUnder the hood of the Altalis Platform
Under the hood of the Altalis Platform
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
Sprint 44 review
Sprint 44 reviewSprint 44 review
Sprint 44 review
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 

Último

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 

Último (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 

DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest

  • 1. DA 592 Project Kaggle Grupo Bimbo Contest Berker Kozan & Can Köklü 24/09/2016
  • 3. Project Description ● Kaggle Contest ● Grupo Bimbo: Mexican company of fresh bakery products. ● Products are shipped from storage facilities to stores. ● The following week, unsold products are returned. ● Need to predict the correct demand for shipping to stores.
  • 4. Why Did We Pick This Project? Why Kaggle? ● Test our “Data Science” abilities in an international field ● Kaggle forum ● Clean data and clear goal ● More time for feature engineering and modelling Why This Project? ● Very common problem ● Chance to work with a very large dataset ● Deadline of the competition (30 August)
  • 5. Tools ● Python 2.7 ● Github ● Jupyter Notebook and Pycharm (integrated with Github) ● NLTK ● XGBoost ● Pickle, HDF5 ● Scikit-learn, NumPy, SciPy ● Garbage Collector
  • 6. Platforms ● Ubuntu (16 GB RAM) ● Macbook Pro (16 GB RAM) ● EC2 Instance on Amazon (100 GB RAM, 16 core CPU) ○ 150$ for 2 days and extra 50$ for backup ● Google Cloud (100 GB RAM, 16 core CPU) ○ 50$ for 1 day ● Google Cloud Preemptible (208 GB RAM, 32 core CPU) ○ 60$ for 3 days ○ Linux command line, connecting with SSH ○ One problem!
  • 7. Data Provided ● Train.csv (3.2 GB) ○ the training set which includes week 3-9 ● Test.csv (251 MB) ○ the test set which includes week 10-11 ● Sample_submission.csv (69 MB) ○ a sample submission file in the correct format ● cliente_tabla.csv ○ client names (can be joined with train/test on Cliente_ID) ● producto_tabla.csv ○ product names (can be joined with train/test on Producto_ID) ● town_state.csv ○ town and state (can be joined with train/test on Agencia_ID)
  • 9. Demanda (Target Variable) ● Mean: 7 ● Median: 3 ● Max: 5,000 ● %75 of data is between 0-6 ● Right-skewed ● This explains why evaluation metric is “RMSLE” ● Before modelling, log target variable (log(variable+1)) ● Before submitting, take exponential (exp(variable)-1)
  • 10. Evaluation Criteria ● The evaluation metric is Root Mean Squared Logarithmic Error. ● Public and Private Scores
  • 11. Dealing with the Large Data To optimize RAM use and speed up XGBoost performance: ● Forced type of the data explicitly ● Converted integers to unsigned ones ● Decreased the accuracy of floating points as much as possible Memory usage is reduced from 6.1 GB to 2.1 GB. An alternative approach would have been reading and processing in chunks.
  • 13. Model 1 - Naive Prediction We first decided to create a naive approach without using Machine Learning. ● Group training data on Product ID, Client ID, Agency ID and Route ID and took mean of demand ● If this specific grouping doesn’t exist, default back to product’s mean demand ● If this doesn’t exist either, simply take the mean of demand This method resulted in a score of 0.73 on public data when submitted.
  • 14. Model 2 - NLTK Based Modelling
  • 15. Feature Engineering We utilized NLTK library to extract following information from the Producto file. ● Weight: In grams ● Pieces ● Brand Name: Extracted through a three letter acronym ● Short Name: Extracted from the Product Name field. We first removed the Spanish “stop words” and then used the stemming
  • 16. Modeling 1. Separate x and y (label) of train; 2. Delete train’s columns which don’t exist in test data; 3. Match the order of train and test column orders same and append test to train vertically; 4. Merge table with Product Table; 5. Use “count-vectorizer” of Scikit-learn on brand and short_name columns to create sparse count-word matrices; 6. Use the output of count-vectorizer to create dummy variables; 7. Separate appended train and test data; 8. Train XGBoost with default parameters on train data and predict test data.
  • 17. Technical Problems ● Garbage Collection ○ We had to remove unused objects and force garbage collection mechanism manually to free this memory ● Data size because of sparsity ○ 70+ million rows and 577 columns in memory would need ~161 GB ○ We solved this problem by using sparse matrices from SciPy library and memory was 5 GB ○ In the example below, we see “COO” sparse method that holds data only different from 0
  • 18. Score and Conclusion RMSLE score were as follows: These scores were worse than the naive approach, so we started to think about a new model. Validation Test 10 week Test 11 week 0.764 0.775 0.781
  • 19. Model 3 - Comprehensive
  • 20. Digging Deeper Data Exploration Train - Test difference ● We analyzed missing products, clients, agencies, routes which exist in train but not in test ● There were 9663 clients, 34 products, 0 agencies and 1012 routes that doesn’t exist in train data. ● The important outcome of this analysis was that: we should build a general model that can handle new products, clients and routes which don’t exist in train data but in test data.
  • 21. Feature Engineering - 1 ● Agencia ○ Agencia file shows each agency’s town id and state name. We merged this file with train and test data on Agencia_ID column and encode state columns into integers. ● Producto ○ We used features from NLTK model, weights and pieces. In addition to them, we included short names of product and brand id. ○ Example of short name of a product and brand id can be seen below. ○ 100,Pan Blanco 460g WON 100 200,Pan Blanco 567g WON 200
  • 22. Feature Engineering - 2 We want to predict how many product are sold in a client came from an agency. ● Why don’t we look at the past numbers of this product which was sold in this client came from this agency? ● If doesn’t exist, why don’t we look at the past numbers of this product sold in this client? Let’s try this logic with product’s short names and also brand id. We named these features : Lag0, Lag1, Lag2 and Lag3
  • 23. Feature Engineering - 3 Other features: ● Total $ amount of a client/product name/product id ● Total unit sold by a client/product name/product id ● Price per unit of a client/product name/product id ● Ratio of goods sold by client/product name/product id Other features: ● Client per town ● Sum of returns of product Total Data Size : 10.6 GB
  • 24. Validation Technique ● We made 2 separate models for week 10 and week 11 ● We didn’t involve “Lag1” variable in the model that predicts week 11 ● We deleted first 3 weeks after feature engineering phase
  • 25. XGBoost & Parameter Tuning Why did we pick XGBoost? ● Boosting Tree Algorithm ● Both Regression and Classification ● Compiled C++ code ● Multi-Thread Parameter Tuning: ● Max depth ● Subsample ● ColSample ● Learning Rate
  • 26. Technical Problems ● Storing Data ○ Picked HDF5 over pickle and csv ● Memory and CPU ○ Max 32 core CPU, 75 GB ram ● Code Reuse and Automation ○ Object Oriented Programming with Python ○ Most of the work was automated. For example: parameterDict = { "ValidationStart":8, "ValidationEnd":9, "maxLag":3, "trainHdfPath":'../../input/train_wz.h5', "testHdfPath1":"../../input/test1_wz.h5".. }... ConfigElements(1,[ ("SPClR0_mean",["Producto_ID", "Cliente_ID", "Agencia_SAK"], ["mean"]), ("SPCl_mean", ["Producto_ID", "Cliente_ID"], ["mean"])...
  • 27. Model Comparison Model Validation 1 Validation 2 Public Score Private Score Naive 0.736 0.734 0.754 NLTK 0.764 0.775 0.781 XGBoost with default parameters 0.476226 0.498475 0.46949 0.49596 XGBoost with parameter tuning 0.469628 0.489799 0.46257 0.48666
  • 29.
  • 30. Looking Back… Critical Mistakes ● Poor data exploration ● Not preparing for system outages ● Performing hyperparameter tuning too late
  • 31. … and forward Further Exploration ● Partial Fitting ● Multiple Models ● Neural Networks