SlideShare uma empresa Scribd logo
1 de 72
Analytics @ Wish
Powered by Fluentd & MongoDB
Hi
I’m Adam.
Wish ♥︎ MongoDB
• Primary database since 2011
• 67x mongod
• AWS → bare metal (SSDs
ftw!)
What’s Wish?
• Mobile eCommerce
• 30M+ users
worldwide
• Top 10 iOS & Android
Experiment
‘cause otherwise you’re just
guessing…
Hypothesis
“Billing Zip” is confusing outside
America
Data
Compare checkout conversions
for international, Android users
Conclusion
~7% boost in mobile sales
Goal
Frictionless analytics to everyone
{“solution”:
[“logging”,
“aggregation”,
“analysis”,
“serving”]
}
{“solution”:
[“logging”,
“aggregation”,
“analysis”,
“serving”]
}
{“solution”:
[“logging”,
“aggregation”,
“analysis”,
“serving”]
}
{“solution”:
[“logging”,
“aggregation”,
“analysis”,
“serving”]
}
{“solution”:
[“logging”,
“aggregation”,
“analysis”,
“serving”]
}
{“solution”:
[“logging”,
“aggregation”,
“analysis”,
“serving”]
}
Request Logs = Source of Truth
{'contest_impressions’:'53060fbd34067e4d6cee70f4,535ad13a7360465e2ca799f8,528b714df689996fdb574800,52597
6a71c23882ab3b73ecb,5285df6db5baba737f459037,5208ae7d3deaf74a6cc65da4,5209e5c31c238861a1ab91cc,5285df6db
5baba735f459061,51f7778f3ba3770a514a5431,527be1fc227d210d2bcdeac5,532fcfe3796f6832713b5c3a,527be203227d2
10dd5cdeaac,52d3ef2806ea960dde85cb97,527bc781227d210d8acdea47,527bc793227d210d4fcdea48,
5208ad653deaf74a4bc65d41,5208acdd1c238846f9ab9028,5182fc1273c67621e507591b,5311ae6c796f68283f8f86c3,
52de2bf4ab980a2d00da786a,5208a9c53deaf74a75c65c6b,52eca45a717951350382e4be,52d3ef73bb5aa51ccf866c01,
533d6fae5aefb0427771f346,5285df6db5baba734d45901b,51c27d8d5ffe8f0b0b9b0359,52d0e002a30fb227725b6e06,
52f71bd89f5ef741d8f34698,52d3ef71bb5aa53135866d76, 5308bc467360464265101ed9,52d3ef27bb5aa5024d866c09,
52c399d60599170e49fd866e,5209be541c23886177ab91db,5208b15e1c2388615fab91b7', '_country_code': u'CA',
'_lang': u'en', '_fb_uid': 500406911, '_device_id': None, '_uid': '4eb346049b120f09f60007c0', '_tid': 2,
'_host': 'adam.corp.contextlogic.com', '_last_id': u'cc3aa96b2b3c45bca11009edc049f2f6',
'_experiment_tags': ['mobile_commerce_home_v4_female_ignore', 'mobile_large_cart_cell_ignore',
'hannibal_cohort_firsttime_buyer_ignore', 'localize_product_names__fr_ignore',
'mobile_cart_guarantee_view_ignore', 'mobile_related_tags_v2_ignore', 'shipping_price_us_ignore',
'stripe_settle_on_ship_control', 'related_super_feed_iphone_show-v4',
'mobile_commerce_home_v3_male_i18n_show', 'braintree_settle_on_ship_control',
'mobile_show_tabbed_billing_page_i18n_ignore', 'mobile_new_guarantee_text_ios_ignore',
'mobile_use_category_signup_flow_i18n_ignore', 'male_curated_first_ipad_ignore',
'mobile_commerce_home_v4_female_i18n_ignore', 'commerce_product_page_show',
'mobile_use_category_signup_flow_v3_ignore', 'mobile_save_for_price_us_female_relaunch_2_ignore',
'web_stripe_checkout_ignore', 'mobile_show_tabbed_billing_page_us_ignore', 'stripe_checkout_show',
'shipping_price_i18n_fixed-price-promo', 'chukou1_pilot_experiment_ignore',
'mobile_implicit_ratings_v1_show', 'feed_commerce_2_control', 'mobile_commerce_home_v3_male_ignore',
'swap_out_male_feed_show-weight-deep', 'related_super_feed_ipad_ignore',
'female_curated_first_iphone_ignore', 'mobile_psuedo_localized_currency_show',
'hannibal_cohort_repeat_buyer_ignore', 'web_boleto_checkout_ignore', 'exploration_v2_control',
'female_curated_first_android_ignore', 'male_curated_first_android_ignore',
'related_super_feed_android_show-v4', 'curated_feed_female_shopping_ignore',
'mobile_localized_currency_control', 'male_curated_first_iphone_ignore',
'mobile_show_required_shipping_fields_ignore', 'mobile_ct2_variable_shipping_price_showcountry',
'mobile_c2c_ignore', 'localize_product_names__es_ignore', 'related_products_v2_control',
'female_curated_first_ipad_ignore', 'mobile_categories_v1_ignore', 'related_super_feed_show',
'mobile_baby_category_signup_flow_ignore', 'mobile_checkout_offer_v2_control',
'mobile_minimum_notification_interval_ignore', 'mobile_show_tabbed_feed_existing_user_ignore',
'mobile_cart_fake_only_x_left_show', 'late_shipment_apology_v2_ignore',
'mobile_show_tabbed_feed_new_user_ignore'], '_app_type': 0, 'impression_feed_category': None, '_client':
'web', '_refer_url': None, 'sort': 'recommended', '_user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36', '_arguments': {},
'_currency': 'CAD', '_protocol': 'http', 'offset': 0, '_method': 'GET', 'count': 40, '_locale': 'en',
'_timestamp': 1401996333, '_bsid': '979b5fbcad4f4fdbb1477ae7ba8ed123', '_is_cached': False, '_version':
None, '_response_status': 200, 'filter': 'all', '_response_time': 0.2887430191040039, '_uri': '/',
'_remote_ip': None, '_is_user_pending': False, '_id': '1e6135e3d2eb4214afdbd99456d71183'}
A feed request…
{
'products_shown': '...',
'feed_category': null,
'sort': 'recommended',
'filter': 'all',
'offset': 0,
'count': 40,
'_uid': '4eb34609ff60007c0',
'_client': 'web',
'_country_code': 'CA',
'_id': '1e6135e3d9456d7183’,
'_last_id’: 'cc39edc49f2f6',
'_experiment_tags': [...],
'_uri': '/',
'_refer_url': null,
'_arguments': {},
'_method': 'GET',
'_locale': 'en',
'_response_status': 200
}
One problem
Searching all requests ever is slow
Transaction Log
{'txn_id': '5390c295e9b9bbe68b2',
'user_id': '4eb346049b9f60007c0’,
'total': 18.0,
'shipping': 2.0,
'items': [{
'product_id': '537b42379b9e3f55f',
'qty': 1,
'price': 16.0 }]
}
{“solution”:
[“logging”,
“aggregation”,
“analysis”,
“serving”]
}
Centralize Logs
• Synchronously?
• Fire & forget?
• fluentd!
Architecture
App server
Wish
fluent
d
Aggregation
server
fluentd
Aggregation
server
fluentd
Hadoop/Hive
{“solution”:
[“logging”,
“aggregation”,
“analysis”,
“serving”]
}
Hadoop & Hive
• Great for log analysis
• Arbitrary queries
• No schema design constraints
Hadoop & Hive
• Running a Hadoop cluster sucks
– TreasureData’s managed Hive solution rocks!
{“solution”:
[“logging”,
“aggregation”,
“analysis”,
“serving”]
}
MongoDB!
• Analysis results → MongoDB
• Store all combinations
– Unsexy, but fast
– 2 TB total
Schema
{"_id": ObjectId(…),
"click_id": 2,
"source_page_id": 1000,
"count": 20171,
"timestamp": 20140601,
Schema
"gender": "Male",
"client": "Android",
"country": "CA",
"experiment_tag":
"zip_help_text-show"}
Let’s Review
MongoDB
Logs (app
servers)
Fluentd
Hadoop/Hive
Tools
Who doesn’t love nifty graphs?
Dashy
• Graphing dashboard
Perimeter
• A/B test reports
– Summary tables,
detailed CSVs
– See trade-offs
Analytics = faster iteration
More growth, more revenue
Analytics = faster iteration
Powered by Fluentd & MongoDB
Happy Analyzing!
adam@wish.com
{“subtitle”:”Why Fluentd?”}
http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-
overview-and-challenges/fulltext
Acquire Data (or
so you think)
WUT!? Invalid
UTF8?
Fix the encoding
issue…
Yell at the
engineers
Some columns
are missing!?
Run the
script…DIVISION
BY ZERO!!!
Hmm…
Logging.priority
=> :not_super_high
Analytics.priority
=> :very_high
Analytics.needs? :logs
=> true
{“subtitle”: ”Overview”,
“has_code”: true,
“has_example”: true}
127.0.0.1 - - [05/Feb/2012:17:11:55
+0000] "GET / HTTP/1.1" 200 140 "-"
"Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/535.19 (KHTML, like Gecko)
Chrome/18.0.1025.5 Safari/535.19"
{
"host": "127.0.0.1",
"user": "-",
"method": "GET",
"path": "/",
"code": "200",
"size": "140",
"referer": "-",
"agent": “Mozilla/5.0 (Windows…"
}
Parse as JSON!
?
[“05/Feb/2012:17:11:55”,“web.access”,{
"host": "127.0.0.1",
"user": "-",
"method": "GET",
"path": "/",
"code": "200",
"size": "140",
"referer": "-",
"agent": “Mozilla/5.0 (Windows…"
}]
?
web.mongodb
web.file
web.hdfs
web.s3
web.mysql
<source>
type tail
path /var/log/apache/access.log
tag web.access
format apache2
</source>
Apache log
Fluentd
<source>
type tail
path /var/log/apache/access.log
tag web.access
format apache2
</source>
<match web.access>
type mongo
user kiyoto
password heartbleed
database web
collection access
… # host, port, etc.
</match>
Apache log
Fluentd
MongoDB
<match web.access>
type copy
<store>
type mongo
user kiyoto
password heartbleed
database web
collection access
… # host, port, etc.
</store>
<store>
type s3
… # aws secret, bucket, etc.
</store>
</match>
Apache log
Fluentd
MongoDB S3
{“subtitle”: ”scalability”}
• Automate
monitoring!
• App and System
metrics
• JSON
everywhere
• 2000+ node
• ~1B events/day
• Forwarder-
Aggregator
{“subtitle”: ”Demo”,
“need”: “Demo Karma”}
<source>
type mongostat
uri “172.17.0.2”
</source>
<match mongostat.*.*>
type mongo
user kiyoto
password heartbleed
database web
collection access
… # host, port, etc.
</match>
Fluentd
MongoDB
MongoDB
Build your own *MS!
{
“install”: “gem install fluentd”,
“website”: “www.fluentd.org”,
“github” : “fluent/fluentd”,
“twitter”: “@fluentd”
}

Mais conteúdo relacionado

Mais procurados

Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
Norberto Leite
 
Case studies session 2
Case studies   session 2Case studies   session 2
Case studies session 2
HBaseCon
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

Mais procurados (20)

MongoDB on Azure
MongoDB on AzureMongoDB on Azure
MongoDB on Azure
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
 
Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
 
MongoDB and Spark
MongoDB and SparkMongoDB and Spark
MongoDB and Spark
 
MongoDB .local Toronto 2019: Finding the Right Atlas Cluster Size: Does this ...
MongoDB .local Toronto 2019: Finding the Right Atlas Cluster Size: Does this ...MongoDB .local Toronto 2019: Finding the Right Atlas Cluster Size: Does this ...
MongoDB .local Toronto 2019: Finding the Right Atlas Cluster Size: Does this ...
 
MongoDB .local Munich 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Munich 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Munich 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Munich 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Case studies session 2
Case studies   session 2Case studies   session 2
Case studies session 2
 
Google Dataflow Intro
Google Dataflow IntroGoogle Dataflow Intro
Google Dataflow Intro
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
If you give a mouse a clickhouse, by Alex Hofsteede, Sentry
If you give a mouse a clickhouse, by Alex Hofsteede, SentryIf you give a mouse a clickhouse, by Alex Hofsteede, Sentry
If you give a mouse a clickhouse, by Alex Hofsteede, Sentry
 
Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 

Semelhante a Real Time Data Analytics with MongoDB and Fluentd at Wish

MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
MongoDB
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
MongoDB
 

Semelhante a Real Time Data Analytics with MongoDB and Fluentd at Wish (20)

MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDB
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
 
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business InsightsWebinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Augmenting Mongo DB with Treasure Data
Augmenting Mongo DB with Treasure DataAugmenting Mongo DB with Treasure Data
Augmenting Mongo DB with Treasure Data
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Creating a Data Driven Culture with Amazon QuickSight - Technical 201
Creating a Data Driven Culture with Amazon QuickSight - Technical 201Creating a Data Driven Culture with Amazon QuickSight - Technical 201
Creating a Data Driven Culture with Amazon QuickSight - Technical 201
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 

Mais de MongoDB

Mais de MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Real Time Data Analytics with MongoDB and Fluentd at Wish

Notas do Editor

  1. Hey everyone! Today I’m going to talk about we use fluentd and MongoDB to power analytics @ Wish. After I’m done, Kiyoto, developer evangelist at Treasure Data and a maintainer of fluentd, is going to talk about how easy it is to fit into your architectures.
  2. So, I’m Adam. I run infrastructure & operations at Wish and I’ve been responsible for our MongoDB deployment since day 1. Like most of you, my background is in development. Back in our cool, pre-launch startup days, I was a backend developer. Then we launched. Suddenly someone had to run production. I volunteered. Now we’re 3 years & 30M users later and I know way too much about MongoDB. So that’s my story. Let’s talk about Wish.
  3. MongoDB has been our primary database since day 1 back in 2011. And I’m here, 30 million users later. So, apparently it’s going pretty well so far. A little bit about our infrastructure - We run 67 mongods and recently moved the DB from AWS to an AWS/bare metal hybrid mostly cause SSDs rock.
  4. Wish is a mobile eCommerce platform. We use personalization technology to give users a feed of relevant products and help them discover cool products at great prices. We have a top 10 app on both iOS and Android and do around in revenue with over 2 million products for sale. [ITERATE, mention 100s of M GMV]
  5. So, how did we get there? Just about every product change at Wish starts life as an experiment. This does 2 things: It lets us understand the impact of our decisions more rigorously It helps us build better intuition about what to do next. Let’s take an example…
  6. Here’s our billing info page in Android. Pretty standard. Give us your CC# and billing zip. Well, we looked at the data and international users were dropping off here at a strangely-higher rate than American ones. How can we improve that?
  7. We had a hypothesis: maybe “billing zip” is confusing to our non-American users.
  8. So, here’s that page again.
  9. We made this change as an A/B test. 50% of Android users saw this version with 1 sentence explaining billing zip. Trivial change.
  10. Now – how do we know if our clever hypothesis was actually true? Well, we need data. Specifically, checkout conversions for international, Android users in that experiment. And we need a way to get it easily. Let’s see what our systems come up with.
  11. So, this is a report from one of the tools I’ll talk about later. It shows the impact of this A/B test over core actions. Most changes are tradeoffs, so it’s important to understand those dynamics. Mostly green, a little red. Number of users buying went up. Profile views went down a bit. I guess users were so busy buying things they forgot to look at profiles?
  12. Thanks to this data, we know we got a 7% boost in sales. For 15 minutes of work. That’s pretty awesome. And doing this hundreds of times is how we grew.
  13. Wish is a data-driven company; virtually all decisions come from data. A lot of companies say they’re data-driven, but to do this rigorously is hard. Often, analytics are clunky, one-off, inflexible, annoying. Show of hands: how many of you have seen this? Analytics that are clunky won’t get used. It has to be frictionless.
  14. So: now we know what we want. Let’s shift gears to how. At a high level, it’s a pretty standard setup.
  15. Take application logs.
  16. Aggregate them and send to Hadoop
  17. Use MapReduce to analyze
  18. And store the result in MongoDB to serve user-facing apps
  19. So, first: logging.
  20. A key idea of our analytics system is that our request logs are the main source of truth. If you log all the details of every HTTP request, you can pretty much reconstruct everything that’s happened in your app. We think of them sorta like the oplog. As proof of this: back in early 2012, a bad migration destroyed a bunch of data. We had to restore from backup, but those logs & some cobbled together tools let us replay basically everything we lost. Really powerful. That power lets us answer any question over any time range, even without knowing the question in advance. So, as our business and product evolve, we’re confident that we have all the data we need. Let’s take a look at these logs…
  21. This is the log of a request to get a product feed Let’s zoom in…
  22. Here we have properties specific to that page. What products were shown? What category, what sort, what filters? What was count/offset in the feed?
  23. Now we have general app-level properties. User ID, web or mobile, country code. ID of the request, ID of their previous request. What experiment buckets they’re in. These properties let us really drill down into our analytics and get different views of data.
  24. Last, we have the HTTP things you’d expect in a request log. URI, referrer, arguments, method, locale, and response code. Request logs are great. They’re an easy guarantee that you’re not missing anything. But, one problem…
  25. We have around 500 billion entries in our request log. Compressed, it’s almost 20 TB. With Hadoop, we can get through it, but it’s needlessly expensive for common queries. Here, an important principle of schema design in MongoDB applies. Denormalize for performance. If we have something we know we’ll have to search a lot, we log it separately. Let’s take an example…
  26. Since we’re in eCommerce, transactions are pretty important and we do a lot of analytics on them. So, every transaction gets logged separately. This is the abridged version of a transaction log. Transaction ID, user ID. Total price, shipping cost. And a list of items the user bought. With this information, we can write cleaner, faster queries instead of trying to parse the information out of the HTTP request that actually completed the purchase.
  27. Now we have thorough logs all over the place, let’s talk about how we get them to Hadoop for analysis
  28. What are our options here? We could send them synchronously over the network to Hadoop. But, even if everything is fine, that costs RTT. And, what if the destination is down or slow or erroring out? Then we’d start impacting users and possibly dropping logs. Logging shouldn’t cause user impact. Next option: We could fire & forget with UDP. It’s fast, but unreliable and you lose logs if Hadoop goes down. Can’t drive business decisions on fundamentally unreliable data. Thankfully: fluentd solves both of these problems. It’s a fast, reliable buffer with flexible inputs & outputs. It scales linearly and is dead-simple to run. We’ve been using it since day 1 and have had no serious issues. What does this look like?
  29. We start in the app which generates the logs. The app synchronously logs to a fluentd running on the same host. There’s no network latency and the load on each local fluentd is trivial, so we’ve never had problems with these getting slow or crashing. The local fluentd buffers logs on disk for reliability. Periodically, it flushes those buffers to a host in our fluentd aggregation tier. These run active/active so scaling them is a breeze. It gives us an easy to monitor & manage conduit for our logs to flow through without imposing costs on the app. The aggregation servers periodically flush into Hadoop. As an added bonus, they also flush into S3 for backup. At every step we have a reliable buffer, so temporary problems at one stage won’t ripple up through the system.
  30. So now that the logs are in Hadoop, let’s talk about what we do with them. But first - quick show of hands – how many people here have heard of Hadoop? Ok, and how many people have heard of Hive? In a nutshell, Hive is a layer on top of Hadoop that lets you write SQL queries instead of MapReduce jobs to analyze data. Makes writing & maintaining complex jobs much easier.
  31. Why do we use Hive for analysis? Hive means not needing to know the questions in advance. We don’t need to worry about optimizing schemas for specific queries, so we can just log everything and figure out the questions later. This is really important, especially for startups. Our product and business have changed a lot, meaning we need to ask different questions. 18 months ago, we didn’t even do eCommerce. Shifting from product discovery to commerce didn’t take a rewrite of our analytics platform… just incremental changes to add new metrics.
  32. But, one big downside of this setup is that Hadoop is notoriously-hard to manage. It needs constant attention to keep things scaling nicely. We really didn’t want to deal with that. Our friends at TreasureData offer a scalable Hive as a service that we’ve been using since launch and it’s been great. Let’s me focus on being a MongoDB expert without worrying about Hadoop.
  33. With all the logging and analysis done, now we need to make those results easily-available to everyone in the company. This is where MongoDB comes in…
  34. Since we want to serve in real-time, we store the results of our analysis in MongoDB. I’ll show you the schemas in a minute, but, at a high level, we store 1 document for every segmentation we care about. For example, we store a count of daily impressions for each page and every combination of gender, country, iOS vs Android vs web, and experiment buckets. We have similar collections for clicks, sessions, transactions, and normalized metrics like clicks per session. This lets us really quickly read the results we need at a cost of lots of writes & storage. It’s a bit unsexy, but it works really well. In total, it uses about 2 TB of storage and takes a few hours to import overnight, which is easily manageable.
  35. So this is the first half of the schema for a document that describes number of clicks on something from a certain page. In this case, click type is ID 2 and the page ID is 1000. Apparently that happened about 20,000 times on whatever timestamp that is.
  36. And in the other half of the schema, we have the segmentations. So, those 20,000 clicks were from male users on an Android in Canada that saw the billing zip help text experiment we talked about. There’d be another document for Female, Canadian iPhone users that saw the help text. And Female, Canadian Android users, etc, etc.
  37. With all that work out of the way, we have all the data we need available in real-time for our users. On top of the collections we just talked about, there’s an API in Python that returns time series data for whatever metric & segmentation you need. With that, developers can run wild building whatever tools they need. Let’s take a look at 2 of the most powerful ones.
  38. The first big tool we built is called Dashy. It’s a graph dashboard that lets you drill down into metrics over time. The graph you can see there shows number of a certain type of click per logged in user, broken down by what device the user was on.
  39. The other tool I want to share today is called Perimeter. It shows us the impact of A/B tests across many metrics. Most experiments are trade-offs: they move some metrics up & some down. Making these trade-offs clear, helps us make better product & business decisions.
  40. To cap it all off, analytics empowers faster iteration. Iteration drives growth, engagement, and revenue.
  41. Without these tools, supported by all the infrastructure we just talked about, we wouldn’t be where we are today.
  42. Thanks for listening. If you have any questions, we’ll do Q&A at the end or feel free to shoot me an e-mail. Now I’m gonna hand the mic off to Kiyoto from Treasure Data to talk about how fluentd works and how easy it is to use.