SPEAKERS
Spark Summit 2015e
Finding Shoe Stores in more than 100k Merchants: Using Apache Spark to group all things!
Solmaz Shahalizadeh (Shopify)
ABSTRACT:
Shopify is the world’s fastest growing e-commerce platform with more than 120k active stores and surpassing $5,000,000,000 in Gross Merchandise Volume. At Shopify, we help emerging small businesses get off the ground and grow into successful companies. Individual businesses can sell a variety of products and services, both online and offline in a single store. The data team at Shopify provides business intelligence (BI) reporting as well as data-driven insights using machine learning and statistical analysis that go beyond BI reports to help merchants be more successful. I joined Shopify’s data team a year ago and what started as a personal project for me, finding stores that sell shoes, turned into a bigger problem: What types of products are sold by our stores? The question became critical when we needed to select eligible stores for a business partnership and our data warehouse kept timing out running data mining queries. During the same time we started using Spark technology and moved our operational data to HDFS. We were able to use the power of distributed computing to mine through millions of records of data: Spark was able to process 67 million records while I went to get a fresh cup of coffee. We were able to successfully categorize our stores based on their products and get the list of partnership-eligible stores. Later on, using Spark’s machine learning library, together with classification data obtained through Amazon’s Mechanical Turk, we were able to form clusters of similar stores and provide industry specific success metrics to them. The ease of use and accessibility of Spark has made me an avid fan and I’d like to share this data journey and the experience that exceeded my expectations.
Using Apache Spark to group all things: Finding Shoe Stores in more than 100k Merchants
1. Finding Shoe Stores in
>100k Merchants: Using
Spark to Group All Things
1
Solmaz Shahalizadeh (@solmaz_sh)
Shopify
2. About me
Currently:
• Finance Data @Shopify
Previous Lives:
• Playing with data in Finance/ Bioinformatics/ Cancer
Research
3. Will Talk about:
• Finding a needle in a haystack
• Trying all the wrong tools for getting insights out of
data and course correction on the way
• Having fun during the process
5. Where are the shoes?
• Started ~ 1 year ago @Shopify
• Wanted desperately to buy
something shoes from our merchants
6. A bit about Shopify stores
Merchants can sell different
kinds of products in a single store
7. More than 60M products
We can give each person in Ottawa 67 products
8. A bit about Shopify stores
There is freedom of speech in
describing a product
9. Too much text to process
Product name, description, vendor, etc.
7,400
13. Problems:
• No distributed data
• Not enough processing power
• Not very smart filters
• Not many examples of actual stores selling shoes
14. Shopify + Pinterest
“Can you create a whitelist of eligible stores for a
collaboration with Pinterest, stores not selling
ammunition, adult material, cigarettes, etc.? It keeps
timing out for me, but here is the SQL query that
needs to be run.”
15. • SELECT shops.domain FROM customers join
shopify.products products on customers.shop_id =
products.shop_id where ( description not ilike
'%ammunition%' and description not ilike
'%cigarette%' and description not ilike '%bong%' and
description not ilike '%ecstasy%' and description not
ilike '%heroin%' and description not ilike '%opium%’
and description not ilike '%cocaine%' and description
not ilike '%amphetamine%' and description not ilike
'%mdma%' and description not ilike '%ghb%' and
description not ilike '%ketamine%' and description not
ilike '%pcp%' and description not ilike '%LSD%' and
description not ilike '%steroid%' and description not
ilike '%mescaline%' and description not ilike
'%vaporizer%' and description not ilike '%hashish%'
and description not ilike '%nicotine%' and description
not ilike '%viagra%' and description not ilike '%cialis%'
and description not ilike '%THC%' and description not
17. Distribute Code and Data
• Get data in distributed file system
• Use better-than-sql tools for analysis
20. Something was still missing
We had filtered around 30k merchants: too much!!
“The Buckshots, the world’s first ammunition for your
thighs”
21. Mechanical Turk
The Amazon’s Mechanical Turk (MTurk) is a crowdsourcing
market place that enables individuals or businesses to co-
ordinate the use of human intelligence to perform the tasks
that computers are currently unable to do.
• classification
• sentiment analysis
• data cleaning
22. Lets try this!
• Show the images of the 4 top selling products of
each store to Turkers
• Allow for selection of multiple categories
29. Finding “Similar” stores
Clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a
cluster) are more similar (in some way or another) to
each other than to those in other groups (clusters).