The volume of data is practically exploding by the day. Online retail world is really struggling to stay afloat in this new turbulent big data environment. There are billions of products listed by various online retailers and the depth of product information across retailers varies. Data mining of unstructured data at this scale makes it a very grueling task. With thousands of new products being added every minute, the competitive landscape changes very quickly. It is a daunting task for retailers to keep track of the competition and take any informed decision about their business. The core of the problem is to have the knowledge-base of ‘Matching Products’ across retailers and it is a very complex problem to solve. This session is about our efforts in utilizing Machine Learning on Big-Data to solve this problem.
* ML in product classification
* NLP/ML in Attribute Extraction
* CV in Image processing
3. Private & Confidential
3
Ugam is a data and analytics
company helping leading
corporations to improve
business decisions
Analytics
application
s
Analytics
services
E-
commerce
operations
17
years
Manufacturer
Distributor
B2C B2B
Manufacturer
Retailer
7. Private & Confidential
7
• Amazon: 562 million 2018
- 372 million 2017.
• ~20 K every hour
Volume
• Every retailer has different
site-cat-path
• Photo, video, Social,
Mobile
Variety
• Periodic, near Real Time,
Real TimeVelocity
• Unstructured data
representation
• Schema Varies per
retailer
Structure
200K +
Categories
500k + Brands
800K +
Attributes
8m + Sellers
400 Million
Products
Processing Performance
Curse of Modularity
Class Imbalance
Curse of Dimensionality
Feature Engineering
Heterogeneity & Noise
8. Private & Confidential
8
Cleaning Deduping Classification Attribution Compression Matching
What - How - Why
The Holy Grail
Retailers
Price Intelligence & Optimization
Assortment Intelligence
Product Content solutions
Analytics for Merchandising &
Marketing Decisions
Brands
Dynamic Pricing
Map Monitoring
Data Aggregation Data Synthesis Data Analysis Data Delivery
9. Private & Confidential
9
Cleaning Deduping Classification Attribution Compression Matching
Category Research
Hierarchical Classification
Multiclass Linear SVM
Convolutional NN
Ensemble
10. Private & Confidential
10
Cleaning Deduping Classification Attribution Compression Matching
Original Data Set
D1 D2 Dn-1 Dn
Multiple
Data sets
Multiple
Classifier
s
Combining
Classifiers
C1 C2 Cn-1 Cn
Bootstrap Aggregating for improved performance
Clothing
Laptops
Electronics
Toys
Handbags &
Luggage
Health
Beauty
Antiques
Kitchen
Miscellaneous
Personal care
Baby
Ensemble
⅀
11. Private & Confidential
11
Black Shoe Black Pointed-toe stilettoBlack High Heel Black studded leather pointed-toe Christian
Louboutin 6” glided heel stiletto for night out
Cleaning Deduping Classification Attribution Compression Matching
Category Research
Text Attributes: CNN,
Sequence Labeling
Image Feature Extraction :
CNN
Type: Casual
Heel Height: 0.5
Inch
Heel Type: Flat
Material: PVC
ASIN:
B077BMVXLQ
Brand: Footsoul
Managed
Attributes
Unmanaged
Attributes
12. Private & Confidential
12
Info Bundle delivered through Image Processing APIPre-classified Input image
Cleaning Deduping Classification Attribution Compression Matching
13. Private & Confidential
13
Cleaning Deduping Classification Attribution Compression Matching
Feature Libraries/Functions Data used for training
Object identification,
Image clustering
• Tensorflow
• Keras (CNN)
• Caffe
• Internal product database
• CIFAR-100
• CIFAR-10
Foreground extraction/
Edge & contour
• OpenCV
• Keras (CNN)
• KITTI vision benchmark
• GTI image database
Template matching/
Brand dectection
• Keras
• OpenCV
• Internal product database
• CIFAR-100
• KITI
• Gait dataset
Text/Color extraction • Tensorflow
• Tesseract
• OpenCV
• Internal product database
• CIFAR-10
• CIFAR-100
Merchandise
Category
Managed Features Coverage achieved
Hardline: Consumer
Electronics,
etc.
• Brand, Color,
Product
• Up to 95%
Soft line: Apparel • Up to 80%
Merchandise
Category
Unmanaged Coverage achieved
Hardline: Consumer
Electronics,
etc.
• MPN, UPC • Up to 80%
Soft line: Apparel • Up to 70%
14. Private & Confidential
14
Cleaning Deduping Classification Attribution Compression Matching
02
Attribute
Extraction• Maximizing attribute coverage
• Brand, MPN/UPC, Category
specific enforcer attribute
04
Associations
• Associative rule matching
Product
Matching
Getting Classification
done• Correct classification gives us
right set of attributes.
03
Compression /
Clustering• Allows us to work on scale
• Hierarchical Agglomerative
Clustering
01
• Exact, Similar matches
19. Private & Confidential
www.ugamsolutions.com
Disclaimer:
The information set out in this presentation is produced by Ugam Solutions (“the Company” or “Ugam”) and is being made available AS IS to recipients
solely for information purposes only. This presentation and its contents are strictly confidential to Ugam and may not be used, reproduced, redistributed
or transmitted, passed on or published, in whole or in part, to any other person for any purpose whatsoever.