SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
Finding Shoe Stores in
>100k Merchants: Using
Spark to Group All Things
1
Solmaz Shahalizadeh (@solmaz_sh)
Shopify
About me
Currently:
• Finance Data @Shopify
Previous Lives:
• Playing with data in Finance/ Bioinformatics/ Cancer
Research
Will Talk about:
• Finding a needle in a haystack
• Trying all the wrong tools for getting insights out of
data and course correction on the way
• Having fun during the process
Using Apache Spark to group all things: Finding Shoe Stores in more than 100k Merchants
Where are the shoes?
• Started ~ 1 year ago @Shopify
• Wanted desperately to buy
something shoes from our merchants
A bit about Shopify stores
Merchants can sell different
kinds of products in a single store
More than 60M products
We can give each person in Ottawa 67 products
A bit about Shopify stores
There is freedom of speech in
describing a product
Too much text to process
Product name, description, vendor, etc.
7,400
Shoe Mining
Its going to be a big win and so much FUN!!!
It was a total success
It was a total success
failure
Problems:
• No distributed data
• Not enough processing power
• Not very smart filters
• Not many examples of actual stores selling shoes
Shopify + Pinterest
“Can you create a whitelist of eligible stores for a
collaboration with Pinterest, stores not selling
ammunition, adult material, cigarettes, etc.? It keeps
timing out for me, but here is the SQL query that
needs to be run.”
• SELECT shops.domain FROM customers join
shopify.products products on customers.shop_id =
products.shop_id where ( description not ilike
'%ammunition%' and description not ilike
'%cigarette%' and description not ilike '%bong%' and
description not ilike '%ecstasy%' and description not
ilike '%heroin%' and description not ilike '%opium%’
and description not ilike '%cocaine%' and description
not ilike '%amphetamine%' and description not ilike
'%mdma%' and description not ilike '%ghb%' and
description not ilike '%ketamine%' and description not
ilike '%pcp%' and description not ilike '%LSD%' and
description not ilike '%steroid%' and description not
ilike '%mescaline%' and description not ilike
'%vaporizer%' and description not ilike '%hashish%'
and description not ilike '%nicotine%' and description
not ilike '%viagra%' and description not ilike '%cialis%'
and description not ilike '%THC%' and description not
Using Apache Spark to group all things: Finding Shoe Stores in more than 100k Merchants
Distribute Code and Data
• Get data in distributed file system
• Use better-than-sql tools for analysis
Spark versus The World
Using Apache Spark to group all things: Finding Shoe Stores in more than 100k Merchants
Something was still missing
We had filtered around 30k merchants: too much!!
“The Buckshots, the world’s first ammunition for your
thighs”
Mechanical Turk
The Amazon’s Mechanical Turk (MTurk) is a crowdsourcing
market place that enables individuals or businesses to co-
ordinate the use of human intelligence to perform the tasks
that computers are currently unable to do.
• classification
• sentiment analysis
• data cleaning
Lets try this!
• Show the images of the 4 top selling products of
each store to Turkers
• Allow for selection of multiple categories
Cleaning MTurk responses
• Otter Press http://www.otterpress.com.au/
Summarizing responses
Summarizing responses
From 30k to 10k shops
From 30k to 10k shops
Using Apache Spark to group all things: Finding Shoe Stores in more than 100k Merchants
Finding “Similar” stores
Clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a
cluster) are more similar (in some way or another) to
each other than to those in other groups (clusters).
Clustering with Spark Mllib
Using Apache Spark to group all things: Finding Shoe Stores in more than 100k Merchants
Some cool clusters
Comparison with Peers
Comparison with Peers
Finally bought some shoes
Lessons Learned
• Need data: use mechanical Turks
• Need processing power: Spark is easy to get started
• Happy Hacking!
Questions?
Questions?
Thank you for listening
Feel free to ping me @solmaz_sh

Mais conteúdo relacionado

Último

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 

Último (17)

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 

Using Apache Spark to group all things: Finding Shoe Stores in more than 100k Merchants

  • 1. Finding Shoe Stores in >100k Merchants: Using Spark to Group All Things 1 Solmaz Shahalizadeh (@solmaz_sh) Shopify
  • 2. About me Currently: • Finance Data @Shopify Previous Lives: • Playing with data in Finance/ Bioinformatics/ Cancer Research
  • 3. Will Talk about: • Finding a needle in a haystack • Trying all the wrong tools for getting insights out of data and course correction on the way • Having fun during the process
  • 5. Where are the shoes? • Started ~ 1 year ago @Shopify • Wanted desperately to buy something shoes from our merchants
  • 6. A bit about Shopify stores Merchants can sell different kinds of products in a single store
  • 7. More than 60M products We can give each person in Ottawa 67 products
  • 8. A bit about Shopify stores There is freedom of speech in describing a product
  • 9. Too much text to process Product name, description, vendor, etc. 7,400
  • 10. Shoe Mining Its going to be a big win and so much FUN!!!
  • 11. It was a total success
  • 12. It was a total success failure
  • 13. Problems: • No distributed data • Not enough processing power • Not very smart filters • Not many examples of actual stores selling shoes
  • 14. Shopify + Pinterest “Can you create a whitelist of eligible stores for a collaboration with Pinterest, stores not selling ammunition, adult material, cigarettes, etc.? It keeps timing out for me, but here is the SQL query that needs to be run.”
  • 15. • SELECT shops.domain FROM customers join shopify.products products on customers.shop_id = products.shop_id where ( description not ilike '%ammunition%' and description not ilike '%cigarette%' and description not ilike '%bong%' and description not ilike '%ecstasy%' and description not ilike '%heroin%' and description not ilike '%opium%’ and description not ilike '%cocaine%' and description not ilike '%amphetamine%' and description not ilike '%mdma%' and description not ilike '%ghb%' and description not ilike '%ketamine%' and description not ilike '%pcp%' and description not ilike '%LSD%' and description not ilike '%steroid%' and description not ilike '%mescaline%' and description not ilike '%vaporizer%' and description not ilike '%hashish%' and description not ilike '%nicotine%' and description not ilike '%viagra%' and description not ilike '%cialis%' and description not ilike '%THC%' and description not
  • 17. Distribute Code and Data • Get data in distributed file system • Use better-than-sql tools for analysis
  • 20. Something was still missing We had filtered around 30k merchants: too much!! “The Buckshots, the world’s first ammunition for your thighs”
  • 21. Mechanical Turk The Amazon’s Mechanical Turk (MTurk) is a crowdsourcing market place that enables individuals or businesses to co- ordinate the use of human intelligence to perform the tasks that computers are currently unable to do. • classification • sentiment analysis • data cleaning
  • 22. Lets try this! • Show the images of the 4 top selling products of each store to Turkers • Allow for selection of multiple categories
  • 23. Cleaning MTurk responses • Otter Press http://www.otterpress.com.au/
  • 26. From 30k to 10k shops
  • 27. From 30k to 10k shops
  • 29. Finding “Similar” stores Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some way or another) to each other than to those in other groups (clusters).
  • 36. Lessons Learned • Need data: use mechanical Turks • Need processing power: Spark is easy to get started • Happy Hacking!
  • 38. Questions? Thank you for listening Feel free to ping me @solmaz_sh

Notas do Editor

  1. p
  2. p
  3. p
  4. https://requester.mturk.com/create/projects/434677/edit