During this session we will cover the best practices for implementing a product catalog with MongoDB. We will cover how to model an item properly when it can have thousands of variations and thousands of properties of interest. You'll learn how to index properly and allow for faceted search with milliseconds response latency and how to implement per-store, per-sku pricing while still keeping a sane number of documents. We will also cover operational considerations, like how to bring the data closer to users to cut down the network latency.
4. The many catalogs problem
1. One department in charge of master product works hard at fitting
4
data into SQL tables
2. Resulting data sits in a SQL server with a couple replicas. It's
forbidden to hit it more than 100 times / sec
3. Other departments need to access the data way more often for
their own services
4. Other departments need more information that is not available
since it did not fit in that long devised rigid SQL schema
5. ETLs and Message Buses are put in place for other teams to try
figure it out themselves…
6. Data becomes inconsistent, fragmented, not up-to-date…
Problem visible both internally and by customers!
5. Search – Using Solr
5
How many Catalogs and
Catalog Caches do you have?
6. The many catalogs problem
6
Online Store
Catalog
Marketing
Catalog
Dozens of catalogs!
Department 3
Catalog
Product Department
Master
Catalog
Department 4
Catalog
Department 5
Catalog
Department 1
Catalog
Message
Bus
ETLs
7. Goal: Single View of Product
• Single view of a product, one central catalog
7
service
• Flexible schema containing all useful data
• Read volume high and sustained, 100k reads / s
• Can seamlessly take write spikes during catalog
update
• Advanced indexing and querying
• Geographical distribution for HA and low latency
8. Agenda
1. MongoDB Overview
2. Catalog Service Architecture
3. Data Store Models
4. Product Search
8
10. MongoDB is a great fit
• Holds complex JSON structures
• Dynamic Schema for Agility
• complex querying and in-place updating
• Secondary, compound and geo indexing
• full consistency, durability, atomic operations
• HA and geo-distributed via Replication
• Near linear scaling via Sharding
• Overall, MongoDB is a unique fit!
10
12. build your data to fit your application
Relational MongoDB
12
{ customer_id : 1,
name : "Mark Smith",
city : "San Francisco",
orders: [ {
order_number : 13,
store_id : 10,
date: “2014-01-03”,
products: [
{SKU: 24578234,
Qty: 3,
Unit_price: 350},
{SKU: 98762345,
Qty: 1,
Unit_Price: 110}
]
},
{ <...> }
]
}
CustomerID First Name Last Name City
0 John Doe New York
1 Mark Smith San Francisco
2 Jay Black Newark
3 Meagan White London
4 Edward Danields Boston
Order Number Store ID Product Customer ID
10 100 Tablet 0
11 101 Smartphone 0
12 101 Dishwasher 0
13 200 Sofa 1
14 200 Coffee table 1
15 201 Suit 2
15. Architecture Overview
15
Information
Management
Merchandising
Content
Inventory
Customer
Channel
Sales &
Fulfillment
Insight
Social
Customer
Channels
Amazon
Ebay
…
Stores
POS
Kiosk
…
Mobile
Smartphone
Tablet
Website
Contact
Center
Social
Facebook
Twitter
…
Application
Servers
API
Data and
Service
Integration
Suppliers
Supply Chain
Management
System
Data
Warehouse
Analytics
3rd Party
In Network
Web
Servers
18. Merchandising - Architecture
19
MongoDB Data Store
Items Pricing Promotions
Variants
Ratings &
Reviews
Search Engine
…
Product Service API
Online Store Marketing Inventory SCMS Public API …
20. Models - Product Page
21
Product
images
General
Informatio
n
List of
Variants
External
Informatio
n
Localized
Description
21. Models - Overview
• Item: the overall product info (e.g. Levi’s 501)
• Variant: a specific variant of an item (e.g. in black size 6)
22
which typically has a specific SKU / UPC
• Price: price information may vary based on the store, the
variant, etc
• Hierarchy: the item taxonomy
• Facet: facets to search products by
• Vendors: a given sku may be available through several
vendors if the site is a marketplace
> Don't try to fit all in the same document!
22. 23
One Item
Hundreds
of sizes
Dozens of
colors
Models – Overview
23. Models - Overview
• A single item may have thousands of variants
• Each variant can have hundreds of attributes
• Altogether a single item can represent many MBs
24
worth of JSON text
• Don't try to fit everything into the same
document!
• Use a schema that is natural and fits the API
24. Models - Item Model
{ "_id": "054VA72303012P", // the item id
25
"desc": [ // item descriptions
{ "lang": "en", "val": "Give your dressy look a lift with ..." }, ...
],
"name": "Women's Kate Ivory Peep-Toe Stiletto Heel",
"category": "/84700/80009/1282094266/1200003270", // hierarchy
"brand": { "id": "2483510", "img": "http://...", "name": "Metaphor" },
"assets": { // references to all assets
"imgs": [
{ "img": { "width": 1900, "height": 1900, "src": "http://..." }, ...
]
},
"shipping": { // shipping specs }, "specs": { // item specs },
"attrs": [ // list of items attributes (facets)
{ "name": "Heel Height", "value": "High (2-1/2 to 4 in.)" },
{ "name": "Toe", "value": "Open toe" }, ...
],
"variants": { // quick info on the variants
"cnt": 9,
"attrs": [
{ "dispType": "DROPDOWN", "name": "Color" },
{ "dispType": "DROPDOWN", "name": "Shoe Size" }, ...
]
},
"lastUpdated": 1400877254787 // keep track of updates }
25. Models - Item Model
• Get item by id
26
db.definition.findOne( { _id: "301671" } )
• Get items from list of ids
db.definition.findOne( { _id: { $in: ["301671", "301672" ] } } )
• Get items by department
db.definition.find({ category: { $regex: "^/84700/" } })
• Get items by category prefix
db.definition.find( { category: { $regex: "^/84700/80009/" } } )
• Secondary Indices
name, category, lastUpdated
26. Models – Variant Model
{ "_id": "05458452563", // the sku
27
"name": "Width:Medium,Color:Ivory,Shoe Size:6.5",
"itemId": "054VA72303012P", // reference to the item id
"altIds": { "upc": "632576103580" },
"assets": { // list of assets specific to variant
"imgs": [
{ "width": 1900, "height": 1900, "src": "http://..." },
{ "width": 1900, "height": 1900, "src": "http://..." }, ...
]
},
"attrs": [ // list of attributes specific to variant
{ "name": "Width", "value": "Medium" },
{ "name": "Color", "family": "White", "value": "Ivory" },
{ "name": "Size", "value": "6.5" }, ...
],
"lastUpdated": 1400877254787 // keep track of updates }
27. Models – Variant Model
• Get variant from SKU
28
db.variant.find( { _id: "05458452563" } )
• Get all variants for a product, sorted by SKU
db.variant.find( { itemId: "054VA72303012P" } ).sort( { _id: 1 } )
• Indices
itemId, lastUpdated
28. Models - Hierarchy
29
{
"_id": "1200003270", // the node id
"name": "Women's Heels & Pumps",
"count": 22305, // how many items in this category
"parents": [ // list of parents
"1282094266"
],
"facets": [ // facets that exists for this category
"Heel Height",
"Toe",
"Upper Material",
"Width",
"Shoe Size",
"Color"
]
}
29. Models – Hierarchy
• Get hierarchy node by id
30
db.hierarchy.find( { _id: "1200003270" } )
• Get hierarchy node from parent id
db.hierarchy.find( { parents: "1282094266" } )
• Get departments (no parent)
db.hierarchy.find( { parents: null } )
• Secondary Indices
parents
30. Models – per Store Pricing
Per store pricing could result in billions of
documents…unless it is built in a modular way:
_id: concatenation of item and store.
Item: can be an item id or variant id (sku)
Store: can be a store group (online) or store id.
31
{ "_id": "skuSPM8824542513_1234/store123",
"price": 69.99,
"sale": {
"salePrice": 42.72,
"saleEndDate": "2050-12-31 23:59:59"
},
"lastUpdated": 1374647707394 }
31. Models – per store Pricing
• Get all prices for a given item
32
db.prices.find( { _id: /^item301671/ )
• Get all prices for a given sku (price could be at item level)
db.prices.find( { _id: { $in: [ /^sku730223104376/, /^item301671/ ])
• Get minimum and maximum prices for a sku
db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price },
max: { $max : price} } })
• Get price for a sku and store id (returns up to 4 prices)
db.prices.find( { _id: { $in: [ "sku730223104376/store1234",
"sku730223104376/sgroup0",
"item301671/store1234",
"item301671/sgroup0"] , { price: 1 })
33. Search – Browse and Search products
Browse by
category
34
Special
Lists
Filter by
attributes
Lists hundreds
of item
summaries
By far the toughest page to get right and fast …
34. Search – Browse and Search products
The previous page presents many challenges:
• Response within milliseconds for hundreds of items
• Faceted search on many attributes: category, brand, …
• Efficient sorting on several attributes: price, popularity
• Pagination feature which requires deterministic ordering
> Search engines are built for this purpose!
35
35. Search – Traditional Architecture
36
Product Data Store Product Search
Indexing
#1 obtain
search
results IDs
#2 obtain objects by
ID from cache or DB
Cache Application
Pre-joined
into objects
36. Search – Traditional Architecture
The traditional architecture issues:
• 3 different systems to maintain: RDBMS,
37
Search engine, Caching layer
• RDBMS schema is complex and static
• Applications needs to talk many languages
37. Search – Architecture with MongoDB
38
Product Data Store Product Search
Indexing
#1 obtain
search
results IDs
Applications
#2 obtain
objects by
list of IDs
MongoDB
Ready-to-use
product
documents
Search Engine
Product API
Application
issues single
query
38. Search - Mongo-Connector
39
MongoDB
Search
Engine
Oplog
Mongo
Connector
#1 Initial dump
of the
collections
#2 Updates
streaming via
Oplog
Translatio
n, filtering
Indexing
Indexing
39. Search - Mongo-Connector
• Open-source Project at
40
https://github.com/10gen-labs/mongo-connector
• Python app that reads from MongoDB's oplog
and publishes to target of choice
• Supports initial sync by dumping the data
• Default connectors for Solr, Elastic Search,
other MongoDB cluster
• Easily extensible to update other systems like
SQL
41. Search – More Searching
42
Images of the matching
variants are displayed
Price and
Rating
Facets for
variants
42. Search – More Searching
… more challenges:
• Attributes at the variant level: color, size, etc
• Attributes from other docs: pricing, ratings, etc
• Display the matching variant's image and details
• Thousands of matching variants for an item, still
43
need to display a single item
• Challenge to properly index the data
> Need for a single summary document per item
43. Search - Architecture
44
MongoDB Data Store
Items Summaries Pricing
Ratings &
Reviews
Variants Promotions
44. Search – Summary Model
{ "_id": "3ZZVA46759401P", // the item id
45
"name": "Women's Chic - Black Velvet Suede",
"dep": "84700", // useful as standalone for indexing
"cat": "/84700/80009/1282094266/1200003270",
"desc": { "lang": "en", "val": "This pointy toe slingback ..." },
"img": { "width": 450, "height": 330, "src": "http://..." },
"attrs": [ // global attributes, easily indexable by SE
"heel height=mid (1-3/4 to 2-1/4 in.)",
"brand=metaphor",
"shoe size=6",
"shoe size=6.5", ...
],
"sattrs": [ // global attributes, not to be indexed
"upper material=synthetic",
"toe=open toe", ...
],
"vars": [
{ "id": "05497884001",
"img": [ // images],
"attrs": [ // list of variant attributes to index ]
"sattrs": [ // list of variant attributes not to index ] }, …
] }
47. Search - Using Solr
Defining the schema in schema.xml
<fields>
<!-- some of the core fields -->
<field name="_id" type="string" indexed="true" stored="true" />
<field name="name" type="text_general" indexed="true" stored="true" />
<field name="cat" type="string" indexed="true" stored="true" />
<field name="price" type="float" indexed="true" stored="true"/>
<!-- the full text to index -->
<field name="desc.0.val" type="text_general" indexed="true" stored="true"/>
<!-- dynamic attributes for facetting -->
<dynamicField name="attrs.*" type="string" indexed="true" stored="true"/>
<!– some Solr specific fields -->
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW"
multiValued="false"/>
<dynamicField name="*" type="ignored" multiValued="true"/>
</fields>
48
48. Search - Using Solr
Starting up the connector
> mongo-connector
> Keep it running, it will just stream the Oplog
49
-m ec2-54-80-63-229.compute-1.amazonaws.com:27017 // the mongo
-t http://localhost:8983/solr // the solr
-d mongo_connector/doc_managers/solr_doc_manager.py
-n "catalog.summary" // target summary collection
--auto-commit-interval=60 // commit every 1 min
…
49. Search – Using Solr
Document in Solr looks like:
{ "desc.0.val": "Our classic "Flying Duck" styled as a ...",
Lists are flattened which is difficult to use
> Must use to named fields to implement Facets
50
"name": "Drake Waterfowl Duck Label SS T-Shirt Army Green",
"attrs.1": "brand=Drake Waterfowl",
"attrs.0": "style=t-shirts",
"cat": "/84700/1200000239/1282094207/1200000817",
"_id": "SPM10823491916",
"_version_": 1479173524477182000,
"timestamp": "2014-09-13T23:09:59.782Z"
}
50. Search – Using Elastic Search
51
Let's use Elastic Search…
52. Search - Using Elastic Search
ElasticSearch understands whole document right off the bat
Just need to tell ES not to tokenize the facets:
> Everything else is indexed auto-magically!
53
$ curl -XPOST localhost:9200/largecat3.summary -d '{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"string" : { // string is the name of default mapping type
"properties" : {
"attrs" : { "type" : "string", "index" : "not_analyzed" }
}
} } }'
53. Search - Using Elastic Search
Starting up the connector
> mongo-connector
> Keep it running, it will just stream the Oplog
54
-m ec2-54-80-63-229.compute-1.amazonaws.com:27017 // the mongo
-t http://localhost:9200 // the ES
-d mongo_connector/doc_managers/elastic_doc_manager.py
-n "catalog.summary" // target summary collection
--auto-commit-interval=60 // commit every 1 min
…
55. Search – Using MongoDB Indexing
56
How about MongoDB's indexes and
Full-Text-Search?
56. Search – Using MongoDB indexing
The summary contains:
• department e.g. "Shoes"
• Fields to index
57
– Category path, e.g. "Shoes/Women/Pumps"
– Price
– List of Item Attributes, e.g. Brand = Guess
– List of Variant Attributes, e.g. Color = red
• Fields not to index
– List of Item Secondary Attributes, e.g. Style = Designer
– List of Variant Secondary Attributes, e.g. heel height = 4.0
57. Search - Using MongoDB indexing
• Get summary from item id
58
db.variation.find({ _id: "p301671" })
• Get summary's specific variation from SKU
db.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } )
• Get summary by department, sorted by rating
db.variation.find( { department: "Shoes" } ).sort( { rating: 1 } )
• Get summary with mix of parameters
db.variation.find( { department : "Shoes" ,
"vars.attrs" : { "color" : "Gray"} ,
"category" : ^/Shoes/Women/ ,
"price" : { "$gte" : 65.99 , "$lte" : 180.99 } } )
58. Search – Using MongoDB indexing
• The following indices are used:
59
– department + attr + category + _id
– department + vars.attrs + category + _id
– department + category + _id
– department + price + _id
– department + rating + _id
• _id used for pagination
• Can take advantage of index intersection
• With several attributes specified (e.g. color=red
and size=6), which one is looked up?
59. Search – Using MongoDB indexing
Facet samples:
{ "_id" : "Accessory Type=Hosiery" , "count" : 14}
{ "_id" : "Ladder Material=Steel" , "count" : 2}
{ "_id" : "Gold Karat=14k" , "count" : 10138}
{ "_id" : "Stone Color=Clear" , "count" : 1648}
{ "_id" : "Metal=White gold" , "count" : 10852}
Single operations to insert / update:
db.facet.update( { _id: "Accessory Type=Hosiery" },
60
{ $inc: 1 }, true, false)
The facet with lowest count is the most restrictive…
It should come first in the $all query!
60. Search – Comparing Solutions
• Search Engine advantages:
61
– Index size (~ 10x smaller than MongoDB's)
– Indexing speed
– Read speed, integrated cache
– All languages support
– Built-in facetted search, which includes facet counts
• MongoDB's Indexing advantages:
– Built-in the data store, no additional server / software needed
– Single query to get the results
– Can filter down the variant entry and save computing
> Winner here is Elastic Search