Data quality is important for Machine Learning applications. Sometimes the data is more important than the model, in this slides I presend some tips & tricks I learned during real world projects I ported to production.
1. DATA QUALITY IS MORE
IMPORTANTTHANYOUTHINK
Amine BENDAHMANE
#DevFest Algiers 2019
2. Who am I?
• PhD candidate inArtificial Intelligence and Robotics
(ComputerVision, Swarm Optimization, Path planning,
Deep Learning, Reinforcement Learning)
• Masters degree in Machine Learning & Patterns
Recognition
• Freelance:Web developer, ML engineer
• Part-timeTeacher
3. Machine Learning for research purpose
Solve a problem
Bring up new ideas
Create new models & algorithms
Adapt existing approaches to
new problems
Improve existing solutions
Change mathematical equations
Analyze different factors
Identify correlations
9. Tips & tricks I learned the hard way
Let’s see next the lessons learned from those 4 projects:
1. Facial Expressions Recognition
2. Image Generation
3. Vehicules Plates Recognition
4. Robotics Path Planning
10. Project 1: Facial Expressions recognition
• 2016
• Nextremer Co. (Tokyo)
• AI engineering intern
• Deep Learning
11. Project 1: Facial Expressions recognition
AI Samurai project (Nextremer Co.)
• To deploy in a robot that uses a Raspberry Pi
• The raspberry is also used for speech and
motion (head, arms)
• NoTensorflow Lite at the moment
=> need a very small model (memory, cpu load)
14. Project 1: Facial Expressions recognition
• Experiment 1
20.000 images
5 expressions
■ CNN
■ CNN + Face landmarks
■ CNN + Face landmarks +
HOG + sliding window
75.1%
74.4%
73.5%
• Experiment 2
35.000 images
7 expressions
50.50%
61.40%
75.20%State of art (8 CNNs)
Our best model
SVM
• Human accuracy: ~65%
15. Project 1: Facial Expressions recognition
• Fer2013
(a) incorrect labels
(b) Faces partially hidden
(c) Cartoon faces
(d) Black or empty images
• Human accuracy: ~65%
16. Project 1: Facial Expressions recognition
• Fer2013
(a) incorrect labels
(b) Faces partially hidden
(c) Cartoon faces
(d) Black or empty images
• Human accuracy: ~65%
17. Project 1: Facial Expressions recognition
By correcting the labels we can get up to 88% of accuracy!
18. Project 2: Images Generation
• 2016
• Nextremer Co. (Tokyo)
• Generating fake car images using DC-GAN
• No interesting dataset for commercial use
• No transfer learning
=> No other choices than creating
our own dataset!
19. Project 2: Images Generation
• Write scripts to:
Collect images: from internet using google & flikrAPIs
Transform the data: resizing, cropping, converting format
Reorganize the dataset: Renaming data, classify in folders, generating labels…
Code available at: https://github.com/amineHorseman/images-web-crawler
• Collecting 20.000 car images of 31 car models (~700 per model)
• Cleaning the data manually
For 5 seconds per image it would take 27 hours!
• The whole process of dataset creating took 4 week
20. Project 2: Images Generation
• Write scripts to:
Collect images: from internet using google & flikrAPIs
Transform the data: resizing, cropping, converting format
Reorganize the dataset: Renaming data, classify in folders, generating labels…
Code available at: https://github.com/amineHorseman/images-web-crawler
• Collecting 20.000 car images of 31 car models (~700 per model)
• Cleaning the data manually
For 5 seconds per image it would take 27 hours!
• The whole process of dataset creating took 4 week
21. Project 2: Images Generation
Training with
4000 images
Training with
20.000 images
Training with 200.000
images (redundant
images, non-cleaned
dataset)
10x bigger
bad results
Training time: 1 week
22. Project 3:Vehicules Plates recognition
• 2018
• Mostaganem
• Detect and localize plates
• Recognize Plate Licence Number
23. Project 3:Vehicules Plates recognition
• 2018
• Mostaganem
• Detect and localize plates
• Recognize Plate Licence Number
24. Project 3:Vehicules Plates recognition
For Plates detection and localization:
• Collecting a dataset of 300 images from internet
• Using data augmentation for generating a bigger dataset
• Using transfer learning onYOLOv3 and training
For Serial Number recognition:
• Creating a dataset of 2000 numbers from vehicule license plates
• Using MNIST pretrained model and using transfer learning
• Segmenting the number into separated digits and predicting
25. Project 3:Vehicules Plates recognition
For Plates detection and localization:
• Collecting a dataset of 300 images from internet
• Using data augmentation for generating a bigger dataset
• Using transfer learning onYOLOv3 and training
For Serial Number recognition
• Creating a dataset of 2000 numbers
• Using MNIST pretrained model and using transfer learning
• Segmenting the number into separated digits and predicting
26. Project 3:Vehicules Plates recognition
While porting to production:
• The client used a surveillance camera from the top with inclined angle
• The camera switch to B&W in the night (CCTV)
• In morning the sun is facing the camera so everything goes black (backlight)
• The serial numbers come at different fonts and formats (different separators)
• The numbers dataset I created was biased (too much 2 and 7, less 5 and 8)
27. Project 3:Vehicules Plates recognition
Other considerations during deployment:
• The client used a Dual CoreCPU! (predictions take 5x longer)
• Every time we retrain the model, we have to move to the client’s office for
deployment (because it has no internet, i.e: mountain)
28. Project 3:Vehicules Plates recognition
Other considerations during deployment:
• The client used a Dual CoreCPU! (predictions take 5x longer)
• Every time we retrain the model, we have to move to the client’s office for
deployment (because it has no internet, i.e: mountain)
The clients don’t understand Error Rate means:
• 5 % errors => 5 error for each 100 records.
• If we have 2000 records a day it would be 100 errors that needs to be manually
edited!
29. Project 4: Robot path planning
• Everything works well in simulation (ROS + Gazebo)
• But in the real experiments, the robots don’t behave as expected!
30. Project 4: Robot path planning
• Everything works well in simulation
• But in the real experiments, the robots don’t behave as expected!
• It turns out that the Laser and Sonar often return zero values (noise)
• Those noisy values affect the training
• We need to explicitly filter those false readings before using ML models
(figure out a method to automatically filter unwanted values)
31.
32. Summary
• Data quality is more important than we think
• Before trying to optimize your model, check how good your data is
• In commercial projects, we often don’t have available data
• Creating a dataset is a fastidious and time consuming task
• A clean dataset may be better than a 10x larger raw dataset
• The data we get during production may not be the same as the data used in
the training
• Pay extra attention to detect bias in our data