SlideShare a Scribd company logo
1 of 25
Download to read offline
nearly three years of continuous changes of approach
to data gathering and processing
(Martin Strycek, Juraj Sottnik)
@rubyslava 2014
We better get it right first
time!
Starting point
● we had two developers
● we had one live server
● we had one cold backup
● we can’t store all the data
● we can’t process all the data
Batch processing - the downsides
● batch every 3 hours
○ delete old data

● updating counters
○ you need to define them upfront

● throwing away old data
○ developer point of view
■ you have no way to correct your mistake

○ business
■ you lose your data
Batch processing - the benefits
● you will learn
○ profiler is your best friend
○ optimizing can be hard and can take time

● what are good access logs good for
○ reconstruct your deleted data
Business says:
save all data
Big Data
● It’s not only about the volume
● What we gonna do with it?
○ We had NO idea!

● We rent more servers.
○ We needed place where to store the data
Big Data
● We went the NoSQL way
○ MongoDB
■ easy replication, possible sharding
■ upsert
■ rich document based queries - we still were one foot
in the SQL world
■ fast prototype

● We were still doing batch processing
● ~15m impressions per day ending with
~5GB raw data per day
Big Data
● each day as collection
○ easy for batch processing

● each impression as a document
● adding processed parameters over
time
● pulling data from 30 collections
○ server is not responding
○ virtual memory is low
Big Data - analytics
● Visitors counts on website/section
○ active - with subscription
○ inactive - without subscription
○ anonymous

● Content consumption
○ how many pageviews
■ active
■ inactive
■ anonymouse

● and others
Business asks:
how many UNIQUE users
did … in month
What we really need
● COUNT(* || DISTINCT ...) GROUP BY
○ entities
○ date periods (day, week, month)
○ combination of entities and date periods [and
some other flags]

● Special demands from analytics team
○ Not too hard to implement with SQL magic

● As fast as possible
○ Minimally as fast as data are incoming

● Still store all historical raw data
○ Ideally compressed
What to do
● Processing raw data?
○ Use lot of space, before getting result
■ We need to store historical data anyway
■ You can store compressed files (LZO) in Hadoop

● Sharding
○ For how long?
○ How to properly determine sharding key(s)?

● Do you have really big amount of data?
● Do you have hardware for running
Hadoop? Really?
● What overnight batch processing really
means?
Naive solution
● Separate counter for each needed
combination, updated for each
impression, maybe with touching DB
○ Fast to generate unique key for combination
■ md5([entityType, entityId, day, dayId].join("|"))

○ Really fast to get value
■ Always primary key
■ Multiget

○ Need to define all GROUP BY combinations
on beginning
○ Failure during processing one impression
■ Need to increment counters in transaction
Real world solution
● Kafka
○ Buffering incoming data
○ Web workers as producers

● Storm / Trident
○ Consuming data from Kafka
○ Processing incoming data
○ Using cassandra as storage backend

● Cassandra
○ Holding counters and helper informations to
determine uniquity
Storm
● Real time processing of unbounded
streams of data
○ Processing data as they come
○ You still need to have computing power
○ Need to transform COUNT(* || DISTINCT ...)
GROUP BY everything to steps of updates of
counters
○ Java, but bolts can be written in different
languages
Storm
● Spouts
● Bolts
Trident
● High level abstraction over Storm
○
○
○
○
○

Joins
Aggregations
Grouping
Filtering
Functions
Trident
● Operating in transactions
● Persistent aggregation
○ “Memcached”
○ Cassandra

● DRPC calls
○ No need to touch Cassandra

● Local cluster for development
● Easy to learn basics
● Hard to discover advanced stuff
■ Lack of documentation
■ Need to tune configuration
Trident
● Functions
○ You can do everything you want
■ Touch DB, read emails, …

○ Stay with java
■ No dependencies problem
■ No performance penalty

● Topology
○ Good to define on beginning
■ Spend time on detailed diagram
■ Save you during implementation and future updates

○ Don’t do it too much complex
■ Problem with loading it
Trident
Cassandra
● Already in our production on different
project
● No SPOF
● Multi Master
● Scalable
● More good stuff
● Lot of new features in 2.x
○ Lite transactions
○ Lot of fixes
■ Good old times on 0.8
■ Our bug report from 2011 - Double load of commit log
on node start :)
Kafka
● A high-throughput distributed
messaging system
● Something like distributed commit log
○ You can set retention
○ You can move reading offset back
■ Used by Trident transactions

● Cluster
● Ideally to use with Trident
Business asks:
are you ready for ~250m
impressions per day?
Thank you.

More Related Content

What's hot

2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)Albert Wong
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAlex Pinkin
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short versionAlex Pinkin
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observabilityOVHcloud
 
Apache Tajo on Swift
Apache Tajo on SwiftApache Tajo on Swift
Apache Tajo on SwiftJihoon Son
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Adrianos Dadis
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...GeeksLab Odessa
 
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT EcosystemMongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT EcosystemMongoDB
 
OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014Ilya Sviridov
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucFraugster
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad openstackindia
 
Hitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicatorHitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicatorHitachi Vantara
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
Handle TBs with $1500 per month
Handle TBs with $1500 per monthHandle TBs with $1500 per month
Handle TBs with $1500 per monthHung Lin
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Databases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeDatabases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeParinaz Ameri
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Tom Grek
 

What's hot (20)

2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short version
 
AmazonRedshift
AmazonRedshiftAmazonRedshift
AmazonRedshift
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observability
 
Apache Tajo on Swift
Apache Tajo on SwiftApache Tajo on Swift
Apache Tajo on Swift
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
 
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT EcosystemMongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem
 
OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014OpenStack MagnetoDB. Atlanta Summit 2014
OpenStack MagnetoDB. Atlanta Summit 2014
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana Goriuc
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
 
Hitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicatorHitachi datasheet-universal-replicator
Hitachi datasheet-universal-replicator
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Handle TBs with $1500 per month
Handle TBs with $1500 per monthHandle TBs with $1500 per month
Handle TBs with $1500 per month
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Databases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeDatabases through out and beyond Big Data hype
Databases through out and beyond Big Data hype
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018
 

Viewers also liked

Removing backgrounds in Photoshop
Removing backgrounds in PhotoshopRemoving backgrounds in Photoshop
Removing backgrounds in PhotoshopLandonPhillips
 
Usability for Port Chester Votes
Usability for Port Chester VotesUsability for Port Chester Votes
Usability for Port Chester VotesWhitney Quesenbery
 
Jenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington StateJenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington StateWhitney Quesenbery
 
Saving Your Budget with Plain Language
Saving Your Budget with Plain LanguageSaving Your Budget with Plain Language
Saving Your Budget with Plain LanguageWhitney Quesenbery
 
Monyfon 1 22 2009
Monyfon 1 22 2009Monyfon 1 22 2009
Monyfon 1 22 2009ralcalde
 
Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)David Evans
 
10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūteris10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūterisArtyom Kobakhidze
 
Like it presentation for it students (LV)
Like it presentation for it students (LV)Like it presentation for it students (LV)
Like it presentation for it students (LV)Artyom Kobakhidze
 
TypeScript intro / mobile dev camp
TypeScript intro / mobile dev campTypeScript intro / mobile dev camp
TypeScript intro / mobile dev campAndrea Balducci
 
Sakai Tools That Engage Students
Sakai Tools That Engage StudentsSakai Tools That Engage Students
Sakai Tools That Engage StudentsLandonPhillips
 
The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015LandonPhillips
 
Preparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in NigeriaPreparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in NigeriaAbdulsalam Masud
 
Hitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'erHitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'erChrissy Welsh
 
Persona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer picturePersona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer pictureWhitney Quesenbery
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法Takeshi Furusato
 
Accessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiencesAccessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiencesWhitney Quesenbery
 
Programming The Arduino Due in Rust
Programming The Arduino Due in RustProgramming The Arduino Due in Rust
Programming The Arduino Due in Rustkellogh
 

Viewers also liked (20)

Removing backgrounds in Photoshop
Removing backgrounds in PhotoshopRemoving backgrounds in Photoshop
Removing backgrounds in Photoshop
 
Usability for Port Chester Votes
Usability for Port Chester VotesUsability for Port Chester Votes
Usability for Port Chester Votes
 
Jenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington StateJenny Greeve - AIGA Design for Democracy in Washington State
Jenny Greeve - AIGA Design for Democracy in Washington State
 
Saving Your Budget with Plain Language
Saving Your Budget with Plain LanguageSaving Your Budget with Plain Language
Saving Your Budget with Plain Language
 
Monyfon 1 22 2009
Monyfon 1 22 2009Monyfon 1 22 2009
Monyfon 1 22 2009
 
Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)
 
Typescript intro
Typescript introTypescript intro
Typescript intro
 
10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūteris10 iemesli, kāpēc man ir vajadzīgs rekrūteris
10 iemesli, kāpēc man ir vajadzīgs rekrūteris
 
Like it presentation for it students (LV)
Like it presentation for it students (LV)Like it presentation for it students (LV)
Like it presentation for it students (LV)
 
TypeScript intro / mobile dev camp
TypeScript intro / mobile dev campTypeScript intro / mobile dev camp
TypeScript intro / mobile dev camp
 
Sakai Tools That Engage Students
Sakai Tools That Engage StudentsSakai Tools That Engage Students
Sakai Tools That Engage Students
 
Class Walkthrough
Class WalkthroughClass Walkthrough
Class Walkthrough
 
The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015The Game is Afoot - SUNY Conference 2015
The Game is Afoot - SUNY Conference 2015
 
Preparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in NigeriaPreparing to become Professional Accountant in Nigeria
Preparing to become Professional Accountant in Nigeria
 
Hitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'erHitchikers guide to Ux'ing without a Ux'er
Hitchikers guide to Ux'ing without a Ux'er
 
Persona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer picturePersona Stories: Weaving together quant & qual for a richer picture
Persona Stories: Weaving together quant & qual for a richer picture
 
Need a little usability?
Need a little usability?Need a little usability?
Need a little usability?
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
 
Accessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiencesAccessibility as Innovation: Creating accessible user experiences
Accessibility as Innovation: Creating accessible user experiences
 
Programming The Arduino Due in Rust
Programming The Arduino Due in RustProgramming The Arduino Due in Rust
Programming The Arduino Due in Rust
 

Similar to Piano Media - approach to data gathering and processing

Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Hisham Mardam-Bey
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentationPrzemysław Pastuszka
 
#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDBdan-p-kimmel
 

Similar to Piano Media - approach to data gathering and processing (20)

Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentation
 
#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB#lspe Building a Monitoring Framework using DTrace and MongoDB
#lspe Building a Monitoring Framework using DTrace and MongoDB
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Piano Media - approach to data gathering and processing

  • 1. nearly three years of continuous changes of approach to data gathering and processing (Martin Strycek, Juraj Sottnik) @rubyslava 2014
  • 2. We better get it right first time!
  • 3. Starting point ● we had two developers ● we had one live server ● we had one cold backup ● we can’t store all the data ● we can’t process all the data
  • 4. Batch processing - the downsides ● batch every 3 hours ○ delete old data ● updating counters ○ you need to define them upfront ● throwing away old data ○ developer point of view ■ you have no way to correct your mistake ○ business ■ you lose your data
  • 5. Batch processing - the benefits ● you will learn ○ profiler is your best friend ○ optimizing can be hard and can take time ● what are good access logs good for ○ reconstruct your deleted data
  • 7. Big Data ● It’s not only about the volume ● What we gonna do with it? ○ We had NO idea! ● We rent more servers. ○ We needed place where to store the data
  • 8. Big Data ● We went the NoSQL way ○ MongoDB ■ easy replication, possible sharding ■ upsert ■ rich document based queries - we still were one foot in the SQL world ■ fast prototype ● We were still doing batch processing ● ~15m impressions per day ending with ~5GB raw data per day
  • 9. Big Data ● each day as collection ○ easy for batch processing ● each impression as a document ● adding processed parameters over time ● pulling data from 30 collections ○ server is not responding ○ virtual memory is low
  • 10. Big Data - analytics ● Visitors counts on website/section ○ active - with subscription ○ inactive - without subscription ○ anonymous ● Content consumption ○ how many pageviews ■ active ■ inactive ■ anonymouse ● and others
  • 11. Business asks: how many UNIQUE users did … in month
  • 12. What we really need ● COUNT(* || DISTINCT ...) GROUP BY ○ entities ○ date periods (day, week, month) ○ combination of entities and date periods [and some other flags] ● Special demands from analytics team ○ Not too hard to implement with SQL magic ● As fast as possible ○ Minimally as fast as data are incoming ● Still store all historical raw data ○ Ideally compressed
  • 13. What to do ● Processing raw data? ○ Use lot of space, before getting result ■ We need to store historical data anyway ■ You can store compressed files (LZO) in Hadoop ● Sharding ○ For how long? ○ How to properly determine sharding key(s)? ● Do you have really big amount of data? ● Do you have hardware for running Hadoop? Really? ● What overnight batch processing really means?
  • 14. Naive solution ● Separate counter for each needed combination, updated for each impression, maybe with touching DB ○ Fast to generate unique key for combination ■ md5([entityType, entityId, day, dayId].join("|")) ○ Really fast to get value ■ Always primary key ■ Multiget ○ Need to define all GROUP BY combinations on beginning ○ Failure during processing one impression ■ Need to increment counters in transaction
  • 15. Real world solution ● Kafka ○ Buffering incoming data ○ Web workers as producers ● Storm / Trident ○ Consuming data from Kafka ○ Processing incoming data ○ Using cassandra as storage backend ● Cassandra ○ Holding counters and helper informations to determine uniquity
  • 16. Storm ● Real time processing of unbounded streams of data ○ Processing data as they come ○ You still need to have computing power ○ Need to transform COUNT(* || DISTINCT ...) GROUP BY everything to steps of updates of counters ○ Java, but bolts can be written in different languages
  • 18. Trident ● High level abstraction over Storm ○ ○ ○ ○ ○ Joins Aggregations Grouping Filtering Functions
  • 19. Trident ● Operating in transactions ● Persistent aggregation ○ “Memcached” ○ Cassandra ● DRPC calls ○ No need to touch Cassandra ● Local cluster for development ● Easy to learn basics ● Hard to discover advanced stuff ■ Lack of documentation ■ Need to tune configuration
  • 20. Trident ● Functions ○ You can do everything you want ■ Touch DB, read emails, … ○ Stay with java ■ No dependencies problem ■ No performance penalty ● Topology ○ Good to define on beginning ■ Spend time on detailed diagram ■ Save you during implementation and future updates ○ Don’t do it too much complex ■ Problem with loading it
  • 22. Cassandra ● Already in our production on different project ● No SPOF ● Multi Master ● Scalable ● More good stuff ● Lot of new features in 2.x ○ Lite transactions ○ Lot of fixes ■ Good old times on 0.8 ■ Our bug report from 2011 - Double load of commit log on node start :)
  • 23. Kafka ● A high-throughput distributed messaging system ● Something like distributed commit log ○ You can set retention ○ You can move reading offset back ■ Used by Trident transactions ● Cluster ● Ideally to use with Trident
  • 24. Business asks: are you ready for ~250m impressions per day?