SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
DATA QUALITY IS MORE
IMPORTANTTHANYOUTHINK
Amine BENDAHMANE
#DevFest Algiers 2019
Who am I?
• PhD candidate inArtificial Intelligence and Robotics
(ComputerVision, Swarm Optimization, Path planning,
Deep Learning, Reinforcement Learning)
• Masters degree in Machine Learning & Patterns
Recognition
• Freelance:Web developer, ML engineer
• Part-timeTeacher
Machine Learning for research purpose
Solve a problem
Bring up new ideas
Create new models & algorithms
Adapt existing approaches to
new problems
Improve existing solutions
Change mathematical equations
Analyze different factors
Identify correlations
Machine Learning for research purpose
Machine Learning process
In real world projects
No
data
Not
enough
data
Bad
quality
data
Biased
data
Data Engineering is harder than we think
Tips & tricks I learned the hard way
Let’s see next the lessons learned from those 4 projects:
1. Facial Expressions Recognition
2. Image Generation
3. Vehicules Plates Recognition
4. Robotics Path Planning
Project 1: Facial Expressions recognition
• 2016
• Nextremer Co. (Tokyo)
• AI engineering intern
• Deep Learning
Project 1: Facial Expressions recognition
AI Samurai project (Nextremer Co.)
• To deploy in a robot that uses a Raspberry Pi
• The raspberry is also used for speech and
motion (head, arms)
• NoTensorflow Lite at the moment
=> need a very small model (memory, cpu load)
Project 1: Facial Expressions recognition
• Fer2013 dataset
35.000 images (48x48px)
7 categories
Face LandmarksCode & results available at: https://github.com/amineHorseman/facial-
expression-recognition-using-cnn
Project 1: Facial Expressions recognition
• Experiment 1
20.000 images
5 expressions
■ CNN
■ CNN + Face landmarks
■ CNN + Face landmarks +
HOG + sliding window
75.1%
74.4%
73.5%
• Couldn’t get better results
Dropout
Regularization
ReLus, LeakyReLu…
Batch Normalization
Hyper parameters optimization
Project 1: Facial Expressions recognition
• Experiment 1
20.000 images
5 expressions
■ CNN
■ CNN + Face landmarks
■ CNN + Face landmarks +
HOG + sliding window
75.1%
74.4%
73.5%
• Experiment 2
35.000 images
7 expressions
50.50%
61.40%
75.20%State of art (8 CNNs)
Our best model
SVM
• Human accuracy: ~65%
Project 1: Facial Expressions recognition
• Fer2013
(a) incorrect labels
(b) Faces partially hidden
(c) Cartoon faces
(d) Black or empty images
• Human accuracy: ~65%
Project 1: Facial Expressions recognition
• Fer2013
(a) incorrect labels
(b) Faces partially hidden
(c) Cartoon faces
(d) Black or empty images
• Human accuracy: ~65%
Project 1: Facial Expressions recognition
By correcting the labels we can get up to 88% of accuracy!
Project 2: Images Generation
• 2016
• Nextremer Co. (Tokyo)
• Generating fake car images using DC-GAN
• No interesting dataset for commercial use
• No transfer learning
=> No other choices than creating
our own dataset!
Project 2: Images Generation
• Write scripts to:
 Collect images: from internet using google & flikrAPIs
 Transform the data: resizing, cropping, converting format
 Reorganize the dataset: Renaming data, classify in folders, generating labels…
 Code available at: https://github.com/amineHorseman/images-web-crawler
• Collecting 20.000 car images of 31 car models (~700 per model)
• Cleaning the data manually
For 5 seconds per image it would take 27 hours!
• The whole process of dataset creating took 4 week
Project 2: Images Generation
• Write scripts to:
 Collect images: from internet using google & flikrAPIs
 Transform the data: resizing, cropping, converting format
 Reorganize the dataset: Renaming data, classify in folders, generating labels…
 Code available at: https://github.com/amineHorseman/images-web-crawler
• Collecting 20.000 car images of 31 car models (~700 per model)
• Cleaning the data manually
For 5 seconds per image it would take 27 hours!
• The whole process of dataset creating took 4 week
Project 2: Images Generation
Training with
4000 images
Training with
20.000 images
Training with 200.000
images (redundant
images, non-cleaned
dataset)
10x bigger
bad results
Training time: 1 week
Project 3:Vehicules Plates recognition
• 2018
• Mostaganem
• Detect and localize plates
• Recognize Plate Licence Number
Project 3:Vehicules Plates recognition
• 2018
• Mostaganem
• Detect and localize plates
• Recognize Plate Licence Number
Project 3:Vehicules Plates recognition
For Plates detection and localization:
• Collecting a dataset of 300 images from internet
• Using data augmentation for generating a bigger dataset
• Using transfer learning onYOLOv3 and training
For Serial Number recognition:
• Creating a dataset of 2000 numbers from vehicule license plates
• Using MNIST pretrained model and using transfer learning
• Segmenting the number into separated digits and predicting
Project 3:Vehicules Plates recognition
For Plates detection and localization:
• Collecting a dataset of 300 images from internet
• Using data augmentation for generating a bigger dataset
• Using transfer learning onYOLOv3 and training
For Serial Number recognition
• Creating a dataset of 2000 numbers
• Using MNIST pretrained model and using transfer learning
• Segmenting the number into separated digits and predicting
Project 3:Vehicules Plates recognition
While porting to production:
• The client used a surveillance camera from the top with inclined angle
• The camera switch to B&W in the night (CCTV)
• In morning the sun is facing the camera so everything goes black (backlight)
• The serial numbers come at different fonts and formats (different separators)
• The numbers dataset I created was biased (too much 2 and 7, less 5 and 8)
Project 3:Vehicules Plates recognition
Other considerations during deployment:
• The client used a Dual CoreCPU! (predictions take 5x longer)
• Every time we retrain the model, we have to move to the client’s office for
deployment (because it has no internet, i.e: mountain)
Project 3:Vehicules Plates recognition
Other considerations during deployment:
• The client used a Dual CoreCPU! (predictions take 5x longer)
• Every time we retrain the model, we have to move to the client’s office for
deployment (because it has no internet, i.e: mountain)
The clients don’t understand Error Rate means:
• 5 % errors => 5 error for each 100 records.
• If we have 2000 records a day it would be 100 errors that needs to be manually
edited!
Project 4: Robot path planning
• Everything works well in simulation (ROS + Gazebo)
• But in the real experiments, the robots don’t behave as expected!
Project 4: Robot path planning
• Everything works well in simulation
• But in the real experiments, the robots don’t behave as expected!
• It turns out that the Laser and Sonar often return zero values (noise)
• Those noisy values affect the training
• We need to explicitly filter those false readings before using ML models
(figure out a method to automatically filter unwanted values)
Summary
• Data quality is more important than we think
• Before trying to optimize your model, check how good your data is
• In commercial projects, we often don’t have available data
• Creating a dataset is a fastidious and time consuming task
• A clean dataset may be better than a 10x larger raw dataset
• The data we get during production may not be the same as the data used in
the training
• Pay extra attention to detect bias in our data
THANKYOU

Mais conteúdo relacionado

Semelhante a Data quality is more important than you think

Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...DeNA
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareTigerGraph
 
Ria Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoSri Ambati
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
Pragmatic machine learning for the real world
Pragmatic machine learning for the real worldPragmatic machine learning for the real world
Pragmatic machine learning for the real worldLouis Dorard
 
The Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninThe Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninInside Analysis
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptxgdgsurrey
 
MODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in PracticeMODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in PracticeHussein Alshkhir
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
 
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?Digipolis Antwerpen
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningNikolay Karelin
 
Lecture 1 computer vision introduction
Lecture 1 computer vision introductionLecture 1 computer vision introduction
Lecture 1 computer vision introductioncairo university
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
 
car number plate detection using matlab image & video processing
car number plate detection using matlab image & video processingcar number plate detection using matlab image & video processing
car number plate detection using matlab image & video processingKesava Korukonda
 
Pragmatic deep learning for image labelling
Pragmatic deep learning for image labellingPragmatic deep learning for image labelling
Pragmatic deep learning for image labellingPierre Gutierrez
 
Computer vision - Applications and Trends
Computer vision - Applications and TrendsComputer vision - Applications and Trends
Computer vision - Applications and TrendsKshitij Agrawal
 
Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Kareem Amin
 

Semelhante a Data quality is more important than you think (20)

Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
Can We Make Maps from Videos? ~From AI Algorithm to Engineering for Continuou...
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Ria Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar on Building AI Products
Ria Sankar on Building AI Products
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
Pragmatic machine learning for the real world
Pragmatic machine learning for the real worldPragmatic machine learning for the real world
Pragmatic machine learning for the real world
 
The Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninThe Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine Learnin
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
MODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in PracticeMODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in Practice
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
Meetup 21/9/2017 - Image Recogonition: onmisbaar voor een slimme stad?
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Lecture 1 computer vision introduction
Lecture 1 computer vision introductionLecture 1 computer vision introduction
Lecture 1 computer vision introduction
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
 
car number plate detection using matlab image & video processing
car number plate detection using matlab image & video processingcar number plate detection using matlab image & video processing
car number plate detection using matlab image & video processing
 
Pragmatic deep learning for image labelling
Pragmatic deep learning for image labellingPragmatic deep learning for image labelling
Pragmatic deep learning for image labelling
 
Computer vision - Applications and Trends
Computer vision - Applications and TrendsComputer vision - Applications and Trends
Computer vision - Applications and Trends
 
Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011
 

Último

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 

Último (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 

Data quality is more important than you think

  • 1. DATA QUALITY IS MORE IMPORTANTTHANYOUTHINK Amine BENDAHMANE #DevFest Algiers 2019
  • 2. Who am I? • PhD candidate inArtificial Intelligence and Robotics (ComputerVision, Swarm Optimization, Path planning, Deep Learning, Reinforcement Learning) • Masters degree in Machine Learning & Patterns Recognition • Freelance:Web developer, ML engineer • Part-timeTeacher
  • 3. Machine Learning for research purpose Solve a problem Bring up new ideas Create new models & algorithms Adapt existing approaches to new problems Improve existing solutions Change mathematical equations Analyze different factors Identify correlations
  • 4. Machine Learning for research purpose
  • 5.
  • 7. In real world projects No data Not enough data Bad quality data Biased data
  • 8. Data Engineering is harder than we think
  • 9. Tips & tricks I learned the hard way Let’s see next the lessons learned from those 4 projects: 1. Facial Expressions Recognition 2. Image Generation 3. Vehicules Plates Recognition 4. Robotics Path Planning
  • 10. Project 1: Facial Expressions recognition • 2016 • Nextremer Co. (Tokyo) • AI engineering intern • Deep Learning
  • 11. Project 1: Facial Expressions recognition AI Samurai project (Nextremer Co.) • To deploy in a robot that uses a Raspberry Pi • The raspberry is also used for speech and motion (head, arms) • NoTensorflow Lite at the moment => need a very small model (memory, cpu load)
  • 12. Project 1: Facial Expressions recognition • Fer2013 dataset 35.000 images (48x48px) 7 categories Face LandmarksCode & results available at: https://github.com/amineHorseman/facial- expression-recognition-using-cnn
  • 13. Project 1: Facial Expressions recognition • Experiment 1 20.000 images 5 expressions ■ CNN ■ CNN + Face landmarks ■ CNN + Face landmarks + HOG + sliding window 75.1% 74.4% 73.5% • Couldn’t get better results Dropout Regularization ReLus, LeakyReLu… Batch Normalization Hyper parameters optimization
  • 14. Project 1: Facial Expressions recognition • Experiment 1 20.000 images 5 expressions ■ CNN ■ CNN + Face landmarks ■ CNN + Face landmarks + HOG + sliding window 75.1% 74.4% 73.5% • Experiment 2 35.000 images 7 expressions 50.50% 61.40% 75.20%State of art (8 CNNs) Our best model SVM • Human accuracy: ~65%
  • 15. Project 1: Facial Expressions recognition • Fer2013 (a) incorrect labels (b) Faces partially hidden (c) Cartoon faces (d) Black or empty images • Human accuracy: ~65%
  • 16. Project 1: Facial Expressions recognition • Fer2013 (a) incorrect labels (b) Faces partially hidden (c) Cartoon faces (d) Black or empty images • Human accuracy: ~65%
  • 17. Project 1: Facial Expressions recognition By correcting the labels we can get up to 88% of accuracy!
  • 18. Project 2: Images Generation • 2016 • Nextremer Co. (Tokyo) • Generating fake car images using DC-GAN • No interesting dataset for commercial use • No transfer learning => No other choices than creating our own dataset!
  • 19. Project 2: Images Generation • Write scripts to:  Collect images: from internet using google & flikrAPIs  Transform the data: resizing, cropping, converting format  Reorganize the dataset: Renaming data, classify in folders, generating labels…  Code available at: https://github.com/amineHorseman/images-web-crawler • Collecting 20.000 car images of 31 car models (~700 per model) • Cleaning the data manually For 5 seconds per image it would take 27 hours! • The whole process of dataset creating took 4 week
  • 20. Project 2: Images Generation • Write scripts to:  Collect images: from internet using google & flikrAPIs  Transform the data: resizing, cropping, converting format  Reorganize the dataset: Renaming data, classify in folders, generating labels…  Code available at: https://github.com/amineHorseman/images-web-crawler • Collecting 20.000 car images of 31 car models (~700 per model) • Cleaning the data manually For 5 seconds per image it would take 27 hours! • The whole process of dataset creating took 4 week
  • 21. Project 2: Images Generation Training with 4000 images Training with 20.000 images Training with 200.000 images (redundant images, non-cleaned dataset) 10x bigger bad results Training time: 1 week
  • 22. Project 3:Vehicules Plates recognition • 2018 • Mostaganem • Detect and localize plates • Recognize Plate Licence Number
  • 23. Project 3:Vehicules Plates recognition • 2018 • Mostaganem • Detect and localize plates • Recognize Plate Licence Number
  • 24. Project 3:Vehicules Plates recognition For Plates detection and localization: • Collecting a dataset of 300 images from internet • Using data augmentation for generating a bigger dataset • Using transfer learning onYOLOv3 and training For Serial Number recognition: • Creating a dataset of 2000 numbers from vehicule license plates • Using MNIST pretrained model and using transfer learning • Segmenting the number into separated digits and predicting
  • 25. Project 3:Vehicules Plates recognition For Plates detection and localization: • Collecting a dataset of 300 images from internet • Using data augmentation for generating a bigger dataset • Using transfer learning onYOLOv3 and training For Serial Number recognition • Creating a dataset of 2000 numbers • Using MNIST pretrained model and using transfer learning • Segmenting the number into separated digits and predicting
  • 26. Project 3:Vehicules Plates recognition While porting to production: • The client used a surveillance camera from the top with inclined angle • The camera switch to B&W in the night (CCTV) • In morning the sun is facing the camera so everything goes black (backlight) • The serial numbers come at different fonts and formats (different separators) • The numbers dataset I created was biased (too much 2 and 7, less 5 and 8)
  • 27. Project 3:Vehicules Plates recognition Other considerations during deployment: • The client used a Dual CoreCPU! (predictions take 5x longer) • Every time we retrain the model, we have to move to the client’s office for deployment (because it has no internet, i.e: mountain)
  • 28. Project 3:Vehicules Plates recognition Other considerations during deployment: • The client used a Dual CoreCPU! (predictions take 5x longer) • Every time we retrain the model, we have to move to the client’s office for deployment (because it has no internet, i.e: mountain) The clients don’t understand Error Rate means: • 5 % errors => 5 error for each 100 records. • If we have 2000 records a day it would be 100 errors that needs to be manually edited!
  • 29. Project 4: Robot path planning • Everything works well in simulation (ROS + Gazebo) • But in the real experiments, the robots don’t behave as expected!
  • 30. Project 4: Robot path planning • Everything works well in simulation • But in the real experiments, the robots don’t behave as expected! • It turns out that the Laser and Sonar often return zero values (noise) • Those noisy values affect the training • We need to explicitly filter those false readings before using ML models (figure out a method to automatically filter unwanted values)
  • 31.
  • 32. Summary • Data quality is more important than we think • Before trying to optimize your model, check how good your data is • In commercial projects, we often don’t have available data • Creating a dataset is a fastidious and time consuming task • A clean dataset may be better than a 10x larger raw dataset • The data we get during production may not be the same as the data used in the training • Pay extra attention to detect bias in our data