SlideShare uma empresa Scribd logo
1 de 19
Building Catalogues for
MSMEs
Deepak Sharma, Bhanu Pratap
Goal
• Provide millions of small enterprises access to structured catalogue
• Faster improved search
• Aggregation
• Services like HSN, Tax Rate etc
Approach
• Bootstrap catalogue from raw product descriptions available with
existing customers
• Existing customers create “masters” for inventory management and invoicing
• These masters are Product Descriptions and Hierarchies but highly specific to
the customer (Since Tally imposes minimal structure on this ontology)
• GK Adv mat 10 pcs (Good Knight Advance Mat)
• Ks - Deo on. 225 (Kamasutra Deodorant)
• Dab Real Pin Rs99 (Dabur Real Pineapple)
• Extremely rich data covering breadth and depth of SKUs. Things we can not
find elsewhere
• Guru Essence EDP 100ml,
• Lal Prive Rose Royale 100ml
Building Catalogue
GK Adv Fast Card mat 10 pcs
Good Night Adv mat 10 pcs
Good Knight Advance 10 pcs
Good Knight Advance mosquito mat 10 pcs
Good Knight Advance refill 45ml
GK Adv active+ liquid refill 45ml
Title: Good Knight Advanced Active+ Liquid Refill
Category: Mosquito Repellent > Liquid > Refill
Brand : Good Knight
Unit of Measurement: {45, ml}
Title: Good Knight Advanced Mosquito Mat
Category: Mosquito Repellent > Mat
Brand : Good Knight
Unit of Measurement: {10, pcs}
Published Stock Item in Catalogue
Published Stock Item in Catalogue
Problem and Challenges
• Summarizing the problem
• Cluster/ Deduplicate raw input data and resolve key attributes to Publish a Stock
Item
• Deduplicate raw product data
• Resolve key attributes in the deduplicated cluster
• Challenges
• Highly contextual Product Descriptions with very little common grammar
• Uncommon abbreviations, transliterations, misspellings etc of attributes
• Lack of reliable attribute information unlike product data from ecommerce websites
• High volume with high product variation
• The reference data against which we need to resolve also needs to be built
simultaneously
First Iteration
• Use the dedupe algorithm provided in the PhD dissertation of Mikhail
Bilenko Learnable Similarity Functions and Their Application to Record
Linkage and Clustering
• Essentially dedupe using edit distance based learnable similarity methods
• The first step is to roughly group data into similar SKUs and only learn and calculate
similarity within groups/blocks
• A different similarity function for each block is learnt
• We tried various mechanisms to block, a challenge is that no other
attribute (brand, category) is explicitly present in the data which cannot be
used for blocking
• Token based blocking
• Word vector based clustering and using that to do blocking
First Pipeline
Segment
Identification
Blocking/High
Level Clustering
Learn Similarity
Model within
Each Linkage
Dedupe
Output
Sample and
Manually
Label
Active Learning
Raw SKUs
• Token Based
• Clustering using Word Vectors
GK Adv Fast Card mat 10 pcs
Good Night Adv mat 10 pcs
Good Knight Advance 10 pcs
Good Knight Advance mosquito mat 10 pcs
Good Knight Advance refill 45ml
GK Adv active+ liquid refill 45ml
Good Knight mosquito mat 10 pcs
A GOOD POTENTIAL BLOCK
In reality what we get
Good Knight Advance 10 pcs
Good Knight Advance mosquito
mat 10 pcs
Good Knight Advance refill 45ml
Hair and Care Advance 45ml
Challenges with Blocking
Lexical Similarity
• Another key challenge is that a lexical paired HMM model was
insufficient as it can not take into account the context required to find
similarity
Good Knight Advance refill
Good Knight Advance mat Quite close in edit distance but quite different SKUs
GK Adv mat
Good Knight Advance mat
Quite different in edit distance but actually same SKUS
A few things we learned
• Blocking requires to be more precise
• Since attributes like Category can be naturally grouping we started thinking
about extracting attributes
• Blocking step needs to scale up. For example, the clustering we were
working on uses clusters as blocking but this requires O(N2)
computations
• Learnable lexical similarity is not able to model the variations in the
SKUs. Similarity requires a representation of the SKU which can take
into account context
Second Iteration – Improving blocking with attribute
extraction
• Extract attributes like brand and category from Product Title and use
them for blocking
• We built our own NER model using BiLSTM-CRF model to detect attributes
• https://arxiv.org/pdf/1806.01264.pdf, Open Attribute Value Extraction from Product
Profiles, Guineng Zheng et. Al.
• We use word2vec representation of the attributes to cluster SKUs to be used
for blocking
• Use hashing techniques to allow faster neighbourhood search from
which clusters are evolved.
• We use LSH to cluster and block
GK Adv Fast Card mat 10 pcs
Good Night Adv mat 10 pcs
Good Knight Advance 10 pcs
Good Knight Advance mosquito mat 10 pcs
Good Knight Advance refill 45ml
GK Adv active+ liquid refill 45ml
Good Knight mosquito mat 10 pcs
GK Adv Fast Card mat 10 pcs
Brand
Category
measurement
Good Night Adv mat 10 pcs
Brand
Category
measurement
Good Knight Advance mosquito mat 10 pcs
Brand
Category
measurement
Second Pipeline
Learn Similarity
Model within
Each Linkage
Dedupe
Output
Sample and
Manually
Label
Active Learning
Raw SKUs
Segment
Identifica
tion
Attribute
Extraction
Clustering
/Blocking
using
Attributes
GK Adv Fast Card mat 10 pcs
Good Night Adv mat 10 pcs
Good Knight Advance 10 pcs
Good Knight Advance mosquito mat 10 pcs
Good Knight Advance refill 45ml
GK Adv active+ liquid refill 45ml
Good Knight mosquito mat 10 pcs
A GOOD POTENTIAL BLOCK
Blocking with attributes
Good Night Adv mat 10 pcs
Good Knight Advance 10 pcs
Good Knight Advance mosquito mat
10 pcs
Good Knight mosquito mat 10 pcs
Good Knight Advance refill 45ml
GK Adv active+ liquid refill 45ml
GK Adv Fast Card mat 10 pcs
Third Iteration – Improving similarity within a
block
• Learning similarity using an affine edit distance model was not
sufficient
• We combine multiple features to calculate similarity
• Word2Vec representation
• Attribute features
• Lexical features
• Soundex
• Abbreviations of Brands/Categories
• Non vowel string match
• We are experimenting with supervised models to combine these
different features
Third Pipeline
Similarity
Calculation
using
Distributed and
Lexical
Features
Dedupe
Output
Sample and
Manually
Label
Active Learning
Raw SKUs
Segment
Identifica
tion
Attribute
Extraction
Clustering
/Blocking
using
Attributes
['00lifebuoy soap care 125g',
'lifebuoy care soap 100gm',
'lifebuoy care soap 62g',
'lifebuoy soap care 125gm',
'100lifebuoy soap care 60g',
'lb soap care 144*59 rs.10',
'lifebuoy soap care 125g',
'lifebuoy care soap 26 mrp',
'lifebuoy care soap 125 g',
'lifebuoy care soap',
'lifebuoy total soap 62g with save 20%',
'lifebuoy soap 26rs',
'lifebuoy soap 27rs',
'lifebuoy soap ₹26/-',
'lifebuoy soap 5/-',
'lifebuoy soap 3*100',
'lifebuoy soap ₹10/-',
'lifebuoy soap 30rs',
'lifebuoy soap',
'lifebuoy soap 24rs',
'lifebuoy soap ₹66/-',
'lifebuoy soap',
'lifebuoy soap ₹27/-',
'lifebuoy soap 94rs',
'lifebuoy soap 29rs',
'lifebuoy soap 28rs',
'lifebuoy soap 25',
'lifebuoy soap care rs.10',
'lifebuoy soap 4pcs 125g',
'lifebuoy care 59g',
'lifebuoy soap nature 125g',
'lifebuoy care 125g',
'lifebuoy care 100g',
'lifebuoy nature soap 125g',
'doy care soap',
'acnelak pimple care soap 75g',
'lifebuoy soap set 18%']
'lifebuoy soap 94rs',
'lifebuoy soap 29rs',
'lifebuoy soap 28rs',
00lifebuoy soap care 125g
lifebuoy care soap 62g
100lifebuoy soap care 60g
lifebuoy soap care 125g
lifebuoy care soap 125 g
lifebuoy care soap
lifebuoy soap ₹26/- -- sp
lifebuoy soap 5/- -- sp
lifebuoy soap
lifebuoy soap 25
Some of the final clusters found
Block
Active Learning
• We use active learning approach to improve our clustering algorithms
• Here we provide samples from our clusters for manual evaluation
• We provide paired products for labelling
• Dissimilar products from same cluster
• Similar products from different cluster
Challenges/Next Steps
• Attribute extraction coverage has to be improved
• Improvement in supervised learning model for calculating similarity

Mais conteúdo relacionado

Semelhante a Outline catalogue fifth_elephant_2019_deepak_sharma_v0.2

Outline catalogue fifth_elephant_2019_deepak_sharma
Outline catalogue fifth_elephant_2019_deepak_sharmaOutline catalogue fifth_elephant_2019_deepak_sharma
Outline catalogue fifth_elephant_2019_deepak_sharmaDeepak Sharma
 
Design Pattern lecture 2
Design Pattern lecture 2Design Pattern lecture 2
Design Pattern lecture 2Julie Iskander
 
Sledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QASledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QAShelley Lambert
 
Behaviour Driven Development: Oltre i limiti del possibile
Behaviour Driven Development: Oltre i limiti del possibileBehaviour Driven Development: Oltre i limiti del possibile
Behaviour Driven Development: Oltre i limiti del possibileIosif Itkin
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Gravy Analytics
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019Paulo Clavijo
 
Weak supervision - Pydata London 2019
Weak supervision - Pydata London 2019Weak supervision - Pydata London 2019
Weak supervision - Pydata London 2019Eddie Bell
 
Object Oriented Programming C#
Object Oriented Programming C#Object Oriented Programming C#
Object Oriented Programming C#Muhammad Younis
 
Segment Your Way to Enlightenment
Segment Your Way to EnlightenmentSegment Your Way to Enlightenment
Segment Your Way to Enlightenmentgsporar
 
Core java training in Marathahalli, Bangalore
Core java training in Marathahalli, BangaloreCore java training in Marathahalli, Bangalore
Core java training in Marathahalli, BangaloreSDLCT
 
Core java training in Marathahalli, Bangalore
Core java training in Marathahalli, BangaloreCore java training in Marathahalli, Bangalore
Core java training in Marathahalli, BangaloreSDLCT
 
Ch. 3 classes and objects
Ch. 3 classes and objectsCh. 3 classes and objects
Ch. 3 classes and objectsdkohreidze
 
Testing in agile
Testing in agileTesting in agile
Testing in agilesachxn1
 
How to crack java script certification
How to crack java script certificationHow to crack java script certification
How to crack java script certificationKadharBashaJ
 
Programmatic Load of the Oracle Clinical Global Library
Programmatic Load of the Oracle Clinical Global LibraryProgrammatic Load of the Oracle Clinical Global Library
Programmatic Load of the Oracle Clinical Global LibraryPerficient
 
Design patterns
Design patternsDesign patterns
Design patternsAlok Guha
 

Semelhante a Outline catalogue fifth_elephant_2019_deepak_sharma_v0.2 (20)

Outline catalogue fifth_elephant_2019_deepak_sharma
Outline catalogue fifth_elephant_2019_deepak_sharmaOutline catalogue fifth_elephant_2019_deepak_sharma
Outline catalogue fifth_elephant_2019_deepak_sharma
 
Design Pattern lecture 2
Design Pattern lecture 2Design Pattern lecture 2
Design Pattern lecture 2
 
Sledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QASledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QA
 
Behaviour Driven Development: Oltre i limiti del possibile
Behaviour Driven Development: Oltre i limiti del possibileBehaviour Driven Development: Oltre i limiti del possibile
Behaviour Driven Development: Oltre i limiti del possibile
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
 
Weak supervision - Pydata London 2019
Weak supervision - Pydata London 2019Weak supervision - Pydata London 2019
Weak supervision - Pydata London 2019
 
Object Oriented Programming C#
Object Oriented Programming C#Object Oriented Programming C#
Object Oriented Programming C#
 
Spec by-example
Spec by-exampleSpec by-example
Spec by-example
 
Segment Your Way to Enlightenment
Segment Your Way to EnlightenmentSegment Your Way to Enlightenment
Segment Your Way to Enlightenment
 
Core java training in Marathahalli, Bangalore
Core java training in Marathahalli, BangaloreCore java training in Marathahalli, Bangalore
Core java training in Marathahalli, Bangalore
 
Core java training in Marathahalli, Bangalore
Core java training in Marathahalli, BangaloreCore java training in Marathahalli, Bangalore
Core java training in Marathahalli, Bangalore
 
Ch. 3 classes and objects
Ch. 3 classes and objectsCh. 3 classes and objects
Ch. 3 classes and objects
 
AT2012_Pune_UserStories_BhawanaGupta
AT2012_Pune_UserStories_BhawanaGuptaAT2012_Pune_UserStories_BhawanaGupta
AT2012_Pune_UserStories_BhawanaGupta
 
Testing in agile
Testing in agileTesting in agile
Testing in agile
 
Design p atterns
Design p atternsDesign p atterns
Design p atterns
 
How to crack java script certification
How to crack java script certificationHow to crack java script certification
How to crack java script certification
 
Programmatic Load of the Oracle Clinical Global Library
Programmatic Load of the Oracle Clinical Global LibraryProgrammatic Load of the Oracle Clinical Global Library
Programmatic Load of the Oracle Clinical Global Library
 
Design patterns
Design patternsDesign patterns
Design patterns
 

Último

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Último (20)

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Outline catalogue fifth_elephant_2019_deepak_sharma_v0.2

  • 2. Goal • Provide millions of small enterprises access to structured catalogue • Faster improved search • Aggregation • Services like HSN, Tax Rate etc
  • 3. Approach • Bootstrap catalogue from raw product descriptions available with existing customers • Existing customers create “masters” for inventory management and invoicing • These masters are Product Descriptions and Hierarchies but highly specific to the customer (Since Tally imposes minimal structure on this ontology) • GK Adv mat 10 pcs (Good Knight Advance Mat) • Ks - Deo on. 225 (Kamasutra Deodorant) • Dab Real Pin Rs99 (Dabur Real Pineapple) • Extremely rich data covering breadth and depth of SKUs. Things we can not find elsewhere • Guru Essence EDP 100ml, • Lal Prive Rose Royale 100ml
  • 4. Building Catalogue GK Adv Fast Card mat 10 pcs Good Night Adv mat 10 pcs Good Knight Advance 10 pcs Good Knight Advance mosquito mat 10 pcs Good Knight Advance refill 45ml GK Adv active+ liquid refill 45ml Title: Good Knight Advanced Active+ Liquid Refill Category: Mosquito Repellent > Liquid > Refill Brand : Good Knight Unit of Measurement: {45, ml} Title: Good Knight Advanced Mosquito Mat Category: Mosquito Repellent > Mat Brand : Good Knight Unit of Measurement: {10, pcs} Published Stock Item in Catalogue Published Stock Item in Catalogue
  • 5. Problem and Challenges • Summarizing the problem • Cluster/ Deduplicate raw input data and resolve key attributes to Publish a Stock Item • Deduplicate raw product data • Resolve key attributes in the deduplicated cluster • Challenges • Highly contextual Product Descriptions with very little common grammar • Uncommon abbreviations, transliterations, misspellings etc of attributes • Lack of reliable attribute information unlike product data from ecommerce websites • High volume with high product variation • The reference data against which we need to resolve also needs to be built simultaneously
  • 6. First Iteration • Use the dedupe algorithm provided in the PhD dissertation of Mikhail Bilenko Learnable Similarity Functions and Their Application to Record Linkage and Clustering • Essentially dedupe using edit distance based learnable similarity methods • The first step is to roughly group data into similar SKUs and only learn and calculate similarity within groups/blocks • A different similarity function for each block is learnt • We tried various mechanisms to block, a challenge is that no other attribute (brand, category) is explicitly present in the data which cannot be used for blocking • Token based blocking • Word vector based clustering and using that to do blocking
  • 7. First Pipeline Segment Identification Blocking/High Level Clustering Learn Similarity Model within Each Linkage Dedupe Output Sample and Manually Label Active Learning Raw SKUs • Token Based • Clustering using Word Vectors
  • 8. GK Adv Fast Card mat 10 pcs Good Night Adv mat 10 pcs Good Knight Advance 10 pcs Good Knight Advance mosquito mat 10 pcs Good Knight Advance refill 45ml GK Adv active+ liquid refill 45ml Good Knight mosquito mat 10 pcs A GOOD POTENTIAL BLOCK In reality what we get Good Knight Advance 10 pcs Good Knight Advance mosquito mat 10 pcs Good Knight Advance refill 45ml Hair and Care Advance 45ml Challenges with Blocking
  • 9. Lexical Similarity • Another key challenge is that a lexical paired HMM model was insufficient as it can not take into account the context required to find similarity Good Knight Advance refill Good Knight Advance mat Quite close in edit distance but quite different SKUs GK Adv mat Good Knight Advance mat Quite different in edit distance but actually same SKUS
  • 10. A few things we learned • Blocking requires to be more precise • Since attributes like Category can be naturally grouping we started thinking about extracting attributes • Blocking step needs to scale up. For example, the clustering we were working on uses clusters as blocking but this requires O(N2) computations • Learnable lexical similarity is not able to model the variations in the SKUs. Similarity requires a representation of the SKU which can take into account context
  • 11. Second Iteration – Improving blocking with attribute extraction • Extract attributes like brand and category from Product Title and use them for blocking • We built our own NER model using BiLSTM-CRF model to detect attributes • https://arxiv.org/pdf/1806.01264.pdf, Open Attribute Value Extraction from Product Profiles, Guineng Zheng et. Al. • We use word2vec representation of the attributes to cluster SKUs to be used for blocking • Use hashing techniques to allow faster neighbourhood search from which clusters are evolved. • We use LSH to cluster and block
  • 12. GK Adv Fast Card mat 10 pcs Good Night Adv mat 10 pcs Good Knight Advance 10 pcs Good Knight Advance mosquito mat 10 pcs Good Knight Advance refill 45ml GK Adv active+ liquid refill 45ml Good Knight mosquito mat 10 pcs GK Adv Fast Card mat 10 pcs Brand Category measurement Good Night Adv mat 10 pcs Brand Category measurement Good Knight Advance mosquito mat 10 pcs Brand Category measurement
  • 13. Second Pipeline Learn Similarity Model within Each Linkage Dedupe Output Sample and Manually Label Active Learning Raw SKUs Segment Identifica tion Attribute Extraction Clustering /Blocking using Attributes
  • 14. GK Adv Fast Card mat 10 pcs Good Night Adv mat 10 pcs Good Knight Advance 10 pcs Good Knight Advance mosquito mat 10 pcs Good Knight Advance refill 45ml GK Adv active+ liquid refill 45ml Good Knight mosquito mat 10 pcs A GOOD POTENTIAL BLOCK Blocking with attributes Good Night Adv mat 10 pcs Good Knight Advance 10 pcs Good Knight Advance mosquito mat 10 pcs Good Knight mosquito mat 10 pcs Good Knight Advance refill 45ml GK Adv active+ liquid refill 45ml GK Adv Fast Card mat 10 pcs
  • 15. Third Iteration – Improving similarity within a block • Learning similarity using an affine edit distance model was not sufficient • We combine multiple features to calculate similarity • Word2Vec representation • Attribute features • Lexical features • Soundex • Abbreviations of Brands/Categories • Non vowel string match • We are experimenting with supervised models to combine these different features
  • 16. Third Pipeline Similarity Calculation using Distributed and Lexical Features Dedupe Output Sample and Manually Label Active Learning Raw SKUs Segment Identifica tion Attribute Extraction Clustering /Blocking using Attributes
  • 17. ['00lifebuoy soap care 125g', 'lifebuoy care soap 100gm', 'lifebuoy care soap 62g', 'lifebuoy soap care 125gm', '100lifebuoy soap care 60g', 'lb soap care 144*59 rs.10', 'lifebuoy soap care 125g', 'lifebuoy care soap 26 mrp', 'lifebuoy care soap 125 g', 'lifebuoy care soap', 'lifebuoy total soap 62g with save 20%', 'lifebuoy soap 26rs', 'lifebuoy soap 27rs', 'lifebuoy soap ₹26/-', 'lifebuoy soap 5/-', 'lifebuoy soap 3*100', 'lifebuoy soap ₹10/-', 'lifebuoy soap 30rs', 'lifebuoy soap', 'lifebuoy soap 24rs', 'lifebuoy soap ₹66/-', 'lifebuoy soap', 'lifebuoy soap ₹27/-', 'lifebuoy soap 94rs', 'lifebuoy soap 29rs', 'lifebuoy soap 28rs', 'lifebuoy soap 25', 'lifebuoy soap care rs.10', 'lifebuoy soap 4pcs 125g', 'lifebuoy care 59g', 'lifebuoy soap nature 125g', 'lifebuoy care 125g', 'lifebuoy care 100g', 'lifebuoy nature soap 125g', 'doy care soap', 'acnelak pimple care soap 75g', 'lifebuoy soap set 18%'] 'lifebuoy soap 94rs', 'lifebuoy soap 29rs', 'lifebuoy soap 28rs', 00lifebuoy soap care 125g lifebuoy care soap 62g 100lifebuoy soap care 60g lifebuoy soap care 125g lifebuoy care soap 125 g lifebuoy care soap lifebuoy soap ₹26/- -- sp lifebuoy soap 5/- -- sp lifebuoy soap lifebuoy soap 25 Some of the final clusters found Block
  • 18. Active Learning • We use active learning approach to improve our clustering algorithms • Here we provide samples from our clusters for manual evaluation • We provide paired products for labelling • Dissimilar products from same cluster • Similar products from different cluster
  • 19. Challenges/Next Steps • Attribute extraction coverage has to be improved • Improvement in supervised learning model for calculating similarity