Data quality is more important than you think

DATA QUALITY IS MORE
IMPORTANTTHANYOUTHINK
Amine BENDAHMANE
#DevFest Algiers 2019

Who am I?
• PhD candidate inArtificial Intelligence and Robotics
(ComputerVision, Swarm Optimization, Path planning,
Deep Learning, Reinforcement Learning)
• Masters degree in Machine Learning & Patterns
Recognition
• Freelance:Web developer, ML engineer
• Part-timeTeacher

Machine Learning for research purpose
Solve a problem
Bring up new ideas
Create new models & algorithms
Adapt existing approaches to
new problems
Improve existing solutions
Change mathematical equations
Analyze different factors
Identify correlations

Machine Learning for research purpose

In real world projects
No
data
Not
enough
data
Bad
quality
data
Biased
data

Data Engineering is harder than we think

Tips & tricks I learned the hard way
Let’s see next the lessons learned from those 4 projects:
1. Facial Expressions Recognition
2. Image Generation
3. Vehicules Plates Recognition
4. Robotics Path Planning

Project 1: Facial Expressions recognition
• 2016
• Nextremer Co. (Tokyo)
• AI engineering intern
• Deep Learning

AI Samurai project (Nextremer Co.)
• To deploy in a robot that uses a Raspberry Pi
• The raspberry is also used for speech and
motion (head, arms)
• NoTensorflow Lite at the moment
=> need a very small model (memory, cpu load)

• Fer2013 dataset
35.000 images (48x48px)
7 categories
Face LandmarksCode & results available at: https://github.com/amineHorseman/facial-
expression-recognition-using-cnn

• Experiment 1
20.000 images
5 expressions
■ CNN
■ CNN + Face landmarks
■ CNN + Face landmarks +
HOG + sliding window
75.1%
74.4%
73.5%
• Couldn’t get better results
Dropout
Regularization
ReLus, LeakyReLu…
Batch Normalization
Hyper parameters optimization

• Experiment 1
20.000 images
5 expressions
■ CNN
■ CNN + Face landmarks
■ CNN + Face landmarks +
HOG + sliding window
75.1%
74.4%
73.5%
• Experiment 2
35.000 images
7 expressions
50.50%
61.40%
75.20%State of art (8 CNNs)
Our best model
SVM
• Human accuracy: ~65%

• Fer2013
(a) incorrect labels
(b) Faces partially hidden
(c) Cartoon faces
(d) Black or empty images
• Human accuracy: ~65%

By correcting the labels we can get up to 88% of accuracy!

Project 2: Images Generation
• 2016
• Nextremer Co. (Tokyo)
• Generating fake car images using DC-GAN
• No interesting dataset for commercial use
• No transfer learning
=> No other choices than creating
our own dataset!

• Write scripts to:
 Collect images: from internet using google & flikrAPIs
 Transform the data: resizing, cropping, converting format
 Reorganize the dataset: Renaming data, classify in folders, generating labels…
 Code available at: https://github.com/amineHorseman/images-web-crawler
• Collecting 20.000 car images of 31 car models (~700 per model)
• Cleaning the data manually
For 5 seconds per image it would take 27 hours!
• The whole process of dataset creating took 4 week

Training with
4000 images
Training with
20.000 images
Training with 200.000
images (redundant
images, non-cleaned
dataset)
10x bigger
bad results
Training time: 1 week

Project 3:Vehicules Plates recognition
• 2018
• Mostaganem
• Detect and localize plates
• Recognize Plate Licence Number

For Plates detection and localization:
• Collecting a dataset of 300 images from internet
• Using data augmentation for generating a bigger dataset
• Using transfer learning onYOLOv3 and training
For Serial Number recognition:
• Creating a dataset of 2000 numbers from vehicule license plates
• Using MNIST pretrained model and using transfer learning
• Segmenting the number into separated digits and predicting

For Plates detection and localization:
• Collecting a dataset of 300 images from internet
• Using data augmentation for generating a bigger dataset
• Using transfer learning onYOLOv3 and training
For Serial Number recognition
• Creating a dataset of 2000 numbers
• Using MNIST pretrained model and using transfer learning
• Segmenting the number into separated digits and predicting

While porting to production:
• The client used a surveillance camera from the top with inclined angle
• The camera switch to B&W in the night (CCTV)
• In morning the sun is facing the camera so everything goes black (backlight)
• The serial numbers come at different fonts and formats (different separators)
• The numbers dataset I created was biased (too much 2 and 7, less 5 and 8)

Other considerations during deployment:
• The client used a Dual CoreCPU! (predictions take 5x longer)
• Every time we retrain the model, we have to move to the client’s office for
deployment (because it has no internet, i.e: mountain)

Other considerations during deployment:
• The client used a Dual CoreCPU! (predictions take 5x longer)
• Every time we retrain the model, we have to move to the client’s office for
deployment (because it has no internet, i.e: mountain)
The clients don’t understand Error Rate means:
• 5 % errors => 5 error for each 100 records.
• If we have 2000 records a day it would be 100 errors that needs to be manually
edited!

Project 4: Robot path planning
• Everything works well in simulation (ROS + Gazebo)
• But in the real experiments, the robots don’t behave as expected!

Project 4: Robot path planning
• Everything works well in simulation
• But in the real experiments, the robots don’t behave as expected!
• It turns out that the Laser and Sonar often return zero values (noise)
• Those noisy values affect the training
• We need to explicitly filter those false readings before using ML models
(figure out a method to automatically filter unwanted values)

Summary
• Data quality is more important than we think
• Before trying to optimize your model, check how good your data is
• In commercial projects, we often don’t have available data
• Creating a dataset is a fastidious and time consuming task
• A clean dataset may be better than a 10x larger raw dataset
• The data we get during production may not be the same as the data used in
the training
• Pay extra attention to detect bias in our data

Data quality is more important than you think

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Data quality is more important than you think

Semelhante a Data quality is more important than you think (20)

Último

Último (20)

Data quality is more important than you think